Statistics for Analytics and Artificial Intelligence
2020-05-03
Chapter 1 Before we Begin
1.1 How to use this book
The text contains theory, general notes, technical notes, Excel functions / R code, and output.
We decided to write this text to balance the theoretical and practical applications of data analysis. As both practitioners and students of the field, we believe that the material contained in this book will benefit those interested in analytics. For those interested in a deeper understanding, the technical notes will provide more detail on what is happening ‘behind the scenes’. At times, we will include extra information in a mouse-over text box like
Once theory is discussed, we will move on to practice. Each technical chapter will contain Excel examples, R code and corresponding output. There is an Excel companion document with the book that provides sample solutions. It is available for download below this paragraph. It is worth noting that this text is geared toward the practical application of analytics, and may sometimes simplify the technical details for the sake of simplicity and practicality.
Download Stats_For_AI_Excel_Companion.xlsx1.2 Introduction
While the term is out of fashion, ‘statistics’ might be thought of as the process of deriving an understanding from The skillset of statistics, understanding data, and how to derive insights from it, is one that will serve you well in just about any domain of business. Whether you work hands-on with the tools associated with analytics, artificial intelligence, or related fields; or simply consume their output, understanding how to interpret data will help you better appreciate the strengths and perhaps more importantly, the weaknesses of the insights from such tools.
There is a lot to learn about data and you have a fair bit of work in front of you. In addition, statistics is a complicated topic with a few different ways one can come to understand the field. For some, mathematics is the best way; for others, an intuitive description of the issues is a better approach. There is no “one size fits all” approach though, you should probably explore the learning you can achieve from each method rather than assume something won’t work for you.
Experience does tell us that statistical thinking is not natural for humans and that developing an intuitive understanding of statistics will require a commitment on your part. Realistically, you can expect that it will take one to two hours of study in preparation for each hour of in-class time. It will also take an hour or two of practice after each class to ensure the material sticks. While this is a big commitment, understanding statistical concepts will differentiate you in a significant way, and serve you well in your career. And really, have you ever found it easy to learn a skill that differentiates you from the crowd?
Our approach is to focus on the business application of statistics. To this end, we have tried to highlight the use of business related examples, and minimize the mathematical rigor. It is our feeling that there are too few introductory statistics books that truly approach the problem from a business perspective. This is unfortunate because, in our experience, most would-be practitioners are not experts at mathematics. Moreover, since statistics is actually done by computers, advanced mathematical skill is rarely required to ‘do’ statistics. Finally, the practical application of statistics in organizations often violates assumptions, invalidating some of the mathematical purity of the practice, but a mathematically-focused exposition leaves little room and no language to explain how practitioners should deal with these violations.
1.3 Software
In this book, all of the examples, problems, demonstrations, etc. that involve technical work will be done in Excel using standard functions, charting tools, and the Data Analysis Tool Pack. Excel is a very powerful analysis tool, with a wide variety of statistical and data management functions, graphing capabilities, and even has an entire programing language, VBA, available for those interested in developing applications.
We will include the equivalent R code for much of the technical work, but it is out of scope for the course.
While Excel is an excellent place to start, we would not recommend it for developing industrial strength tools or repeated analysis. Many of the characteristics that make Excel easy to use also make it difficult to enforce data and analytics integrity rules. So, you should plan to learn one or more additional tools beyond Excel. Some likely candidates to consider are R and Python whose low price (free) and growing popularity make them good alternatives.
R and Python are, in our experience, the most popular statistics tools used in the Analytics and AI world. Both have advantages and disadvantages, but have similar capabilities from a modelling perspective. Neither have the same constraints in terms of data size or computation time as Excel, so as your analysis and data needs grow more complex you will need to migrate your work to one of these languages. Our experiences suggest R is easier to learn at first, but does not provide the same general programming abilities that Python does. From a machine learning perspective, the large community support for Python and packages like TensorFlow or SciKitLearn differentiate Python from most other analytics tools. It is worth mentioning that once you learn your first programming language each successive language becomes easier to learn.
Given that we are using Excel, you will want to install the data analysis tool pack. This can be done on either Microsoft or Apple operating systems, following the instructions provided in the appendix. Since implementations differ across machines, the instructions we provide will not exactly match your computer so you may need to contact the IT department at your work or school if they do not suffice.
Chapter 2 Populations and Parameters, Samples, and Statistics
2.1 Preview
This chapter starts with a discussion of samples, populations, and data generating processes (hereafter DGP). Sometimes we will want to use to communicate the sample’s characteristics to other people – we do that using summary statistics and charts (Chapter 3). Descriptive statistics is important for communication, but generally, we focus on using data and to say something about the characteristics of the population the sample came from, or the DGP that created it.
From there we will discuss types of data both in terms of information content (nominal, ordinal, interval and ratio) and time dependencies (cross-sectional, time-series and panel). We will cover these types of data and their definitions in detail later in the chapter. You will see that numbers and data can actually mean very different things and generally require different analysis techniques.
2.2 Samples, Populations and Data Generating Processes
Statistics can be thought of as a formal way to get information from data. Typically, this data comes in the form of a sample, which represents a broader population or (see Figure 2.1). Unfortunately, we seldom care about the characteristics of the sample itself. What we do care about is the population or data generating process that the sample is supposed to represent. But, the sample is the thing we have to work with. The sample, if it is representative of the population, allows us to draw inferences about the population on a smaller scale, especially in instances where you cannot survey the population as a whole. For example, it is difficult to conduct a question-based survey on every Canadian citizen aged 25-34, as it would take significant time and monetary resources. Instead, we can select a sub-set, or sample, of this population to complete the survey.
Populations and samples are both groups of things like people, products, etc., and we typically only care about certain characteristics of those things – like height, earning, employment status, marital status, likelihood to purchase, etc.
When we speak of a population, the characteristics are called ‘parameters’. These characteristics are typically specific, fixed numeric values that are not known. For example, all of the undergraduate students at our university, who started in September of 2018, constitute a population. That population is large but finite. We could find out the age of each student at the start of their first class and whether or not they are male. From this we could calculate the population parameter that reflects the average starting age of students and the proportion of males. These would be specific numbers.
When we have a sample, the characteristics are called ‘statistics’. So, if I were to take a group of 20 new students from our incoming class and ask them their age and whether or not they are male, the results would be sample statistics. The average age in a particular class might be 21.31 years and the number of males might be 9, yielding a proportion of 9/20.
If I drew another sample, the numbers would likely be different, perhaps that sample would have an average age of 20.84 and the number of males might be 12 for a proportion of 12/20. In general, sample statistics are random numbers whose randomness depends on the population parameters, the size of the sample, and potentially other characteristics. With some caveats, we can generally think of the sample statistics as estimates of the population parameters.
Descriptive statistics is about expressing the information contained in a sample, through summary statistics, graphs, charts, etc. In chapter 3, we will introduce some techniques for this. Inferential statistics is about taking the information from a sample to draw conclusions about the population’s parameters. Generally the larger the sample is, the more closely the statistics estimate the population parameters from which the sample was drawn.
The idea of a DGP in this context is a generalization of the idea of a population. In particular, we can think of the DGP as actually creating the population. In some cases, we may have access to the actual population and still not care about the characteristics of the population but care a lot about the process that created it. Aside from by simulation, the DGP is impossible to measure, but it can be estimated. In the student example above, the population parameter could be seen as a sample statistic estimating the DGP’s unknowable parameter. This makes sense since the observed population is but one of the populations that could have been generated.
Figure 2.1: DGP to Population to Sample
2.2.1 Proper sampling is more difficult than you think
In this chapter, we want to emphasize two key ideas. The first is that for ‘standard theory’ to apply, our sampling process must reflect the population we actually care about. This is often more difficult than you might expect and can be problem in practice. The fact that samples may not reflect the population of interest is one of the many sources of what we like to call “the illusion of false confidence”.
For example, you might like to predict how your customers will react to a change in your product. To do so, you ask a random sample of 2,000 of your existing customers how they feel about it. You get a giving you 300 individuals from which to generate your predictions. The problem is that, while the original random sample reflected the population, the individuals who chose to respond may not.
In survey responses, it is very likely that those who respond are somehow different from the rest of the population – clearly at least insofar as they actually responded when others did not. Some differences are visible (e.g., they may be different from the average in age, employment status, education, etc.) and some are not visible (e.g., dispositional characteristics such as how much they actually like your product or company). To some degree, you can but the invisible ones are much harder to deal with.
2.3 Sampling techniques
There are many ways of sampling, each with its strengths and weaknesses.
2.3.1 Simple Random Sample
This process randomly selects a fixed number of members of the population where each member has an equal probability of being selected. This system is very simple to use, but it may not be the most cost effective method. For example, if individuals are widely distributed across Canada, it may be cost prohibitive to collect data from all of them. It may also be inefficient from a sample size perspective. If the sample contains distinct groups, particularly of different sizes, a very large sample may be required to ensure that the constituents of those groups are sufficiently well represented within the final sample.
2.3.2 Stratified Sampling
This process divides the population into distinct groups and then samples the groups randomly. It ensures that each recognized group is reflected in the resulting sample and therefore provides the possibility of a more authentic sample at a given sample size.
Startified sampling, at least in principle, adds to the complexity of analysis because individual observations no longer represent the same weight in the overall population. Rather, each cluster has a potentially different weighting in the population. One must also recognize and be able to detect members of the different groups in order to select and weigh on that basis.
For example, in an employee satisfaction survey, one might decide to stratify the sample by role within the organization. A simple stratification in a factory might result in ‘administrative’, ‘custodial’, ‘line’, and ‘supervisory’ roles. It may be that there are hundreds of line workers but only 10 workers in custodial roles. Stratifying would ensure that these roles were adequately sampled where random sampling might miss these groups entirely.
2.3.3 Cluster Sampling
This process groups the sample into clusters based on characteristics that are easy to sample such as location or time. Individual clusters are then selected at random and those clusters are sampled through a simple sampling process.
In the factory example described above, the cluster might be all those people working on a particular shift. It might be far more convenient to sample those people as a group rather than to sample randomly across the entire population of factory workers.
Since membership in the same cluster may have common characteristics, the cluster sample may underestimate the variability in the population, necessitating a larger sample. Still, the reduced cost of sampling this way may make it an efficient way of collecting data.
2.3.4 Convenience Sampling
This process involves drawing a sample based on those members of the population that are easiest to sample. While this is likely very easy to implement and do so at low cost, for what should be obvious reasons, it will likely not to represent the population it is intended to reflect. It is probably best for an unimportant preliminary study, not something that needs to be statistically valid.
For example, if we wanted to know something about people who live in Kingston – say salaries – I could ask a bunch of my friends. If we had done this 20 years ago, many of my friends were full time students, and did not earn a lot of money; now many of my friends are professors, so if we were to use a convenience sample of my friends to reflect Kingston, either then or now, it would not reflect the population.
2.3.5 Snowball Sampling
This process involves starting with a small, possibly convenient sample, then having the member of that sample recommend other individuals to sample.
While this process may strike you as being obviously flawed (e.g., biased), it may be one of the only methods you can use when population membership is difficult to detect or potential respondents will not provide information without an introduction.
While a purist might swear off such an approach, one awkward truth of statistics is that you can only use the data you can get. We are a bit more sanguine and would suggest that each of these methods has their place – but only if you understand their strengths and weaknesses.
2.3.6 How to choose?
There is no ‘one best’ sampling method. Instead, you should evaluate all the sampling methods based on your business requirements. In an ideal world, cost and time are not factors and you can choose the most accurate sampling method. In reality, every project has a budget that will constrain your sampling approach and size. When we speak of sampling accuracy, we are essentially speaking to minimizing standard error, or the difference between the sample characteristics and the population characteristics.
There are however, still some principles to keep in mind. If you know your population well and are willing to accept the extra complexity, it becomes easier to create a stratified sample, which is often considered one of the more representative techniques. If you are not willing to invest in a stratified approach, a cluster sample or random sample can also perform well.
Sample size is also an important consideration. We can determine how large a sample we need based on the level of error we are willing to make, and our ideal confidence interval around it. We will present the formula here, but you should revisit this after Chapters 3 and 4.
\[n = (\frac{z_\frac{\alpha}{2} \sigma_x}{E})^2 \]
2.4 Information Content of Data
We also want to understand the characteristics we are measuring and ensure we apply the appropriate tools. Some data allow for mathematical operations that others do not.
Data can be grouped into the following hierarchy:
2.4.1 Categorical Data
Categorical data is data where the numeric values represent category membership. With only one exception that pertains to dummy variables, no mathematical operations can sensibly be performed on the values of categorical data.
For example, if marital status of our respondents were coded as single = 1, married = 2, divorced = 3, and other = 4, there would be no meaning to the average marital status of a sample of respondents. To see why not, note that we could have coded marital status with a different set of numbers, such as: divorced = 1, single = 2, other = 3 and married = 4, which would contain all of the original information but would certainly produce a different average.
Categorical data can be described with counts of observations and visualized using a bar chart, but should not be used in any mathematical operations without referencing another variable (i.e., what is the average income of a married person?).
Dummy variables are a special case of categorical variables. They take on values of only 0 or 1 and are used to define membership in a single category. So one could define:
\[ Male = \begin{cases} 1\: if\: male \\ 0\: if\: otherwise \end{cases}\]
The structure of dummy variables makes certain mathematical operations sensible. For example, you can take an average of the male respondents as the proportion of males in a group.
\[ Proportion\:Male = \frac{1}{ n } \sum_{ i =1}^{ n } Male_{ i } \]
You can take this one step further by noting that the proportion of males in a group is the probability that a randomly selected individual will be male, so the proportion is actually a probability and both of them are an average of the dummy variable.
YOu can also use dummy variables in regression and related equations. At risk of getting ahead of ourselves, we will later discuss models like this one:
\[ Sales= \beta_{0}+\beta_{1}Income+\beta_{M}Male+other\:variables \]
to model a dependent variable of interest, in this case, ‘Sales’ as being explained by a collection of independent variables, including ‘Male’, where we believe males have different purchase patterns from others. Here we will see what the impact of being in the category ‘male’ is.
2.4.2 Ordinal Data
Ordinal Data captures meaningful sequences in data, but where the difference between the individual values does not have a numerical interpretation. Ordinal data can for example, capture the level of difficulty of a task as trivial, easy, medium, challenging, or hard with values 1, 2, 3, 4, and 5, respectively.
With ordinal data, there is meaning to sequence, so for the difficulty example, a value of 5 (hard) is further up the scale than 3 (medium), which is further up the scale than 1 (easy), but there is no meaning to the distance between the elements. The change in difficulty, however measured, from easy to medium is not necessarily the same as the move from medium to hard; nor can it be concluded that hard is three times as difficult as easy.
With ordinal data, any coding that maintains sequence preserves all of the information in the ordinal ranking, so the coding could be changed to trivial = 2, easy = 3, medium = 5, challenging = 7, and hard = 11 without changing the order and therefore preserving the original information.
As a result, the mathematical operations median and mode are sensible for ordinal data, but mean (i.e., average) is not. To see why, note that the median and mode will remain the same whatever
While that all sounds good in theory, ordinal scales are frequently used in five or seven-point Likert scales to capture how much an individual agrees with a statement anchored with statements such as ‘strongly disagree, disagree, neither agree nor disagree, agree, strongly agree’ which could be coded with values 0, 1, 2, 3, and 4 respectively.
To the horror and disgust of statistical purists, the values of Likert scales are often averaged as though the scales represented either interval or ratio data – the two other data types which we will discuss shortly. While technically this is not defined for ordinal data, it can be motivated to some degree with a few assumptions. For example, if we assume the ordinal scale is measuring a latent numeric variable, and the mapping to the scale was an accurate reflection of that scale, then averaging the scale would be a proxy for
Imagine that people rate problems as trivial if they can be solved in 0 to 5 minutes; easy if it takes 5 to 10 minutes, etc. Then scale values of 1, 2, … 5 are really rough estimates of the latent number of minutes it takes to solve the problem, so taking the average of the scale roughly maps to the average of the number of minutes it will take to solve a problem.
You probably noticed that the above two paragraphs are the most complex ones you have seen in the book so far. That is because it is difficult, though not impossible, to justify the very common practice of averaging ordinal data. In our experience no one ever attempts to identify or verify the assumptions necessary for taking an average of Likert scales, they just do it. And you will too.
2.4.3 Interval Data
Interval data goes one step further than ordinal data by adding meaning to the distances between the values in an ordinal scale. This allows for most numeric operations and certainly averages, standard deviations, and other statistics.
Two examples of interval scales are temperatures in Celsius and Fahrenheit. It makes sense to say, in either scale, that 20C is 5C warmer than 15C, and that the average of 20C and 30C is 25C. This can be seen with the
| Ordinal | Celsius | Fahrenheit |
|---|---|---|
| Cold | 15 | 59 |
| Cool | 20 | 68 |
| Warm | 25 | 77 |
| Hot | 30 | 86 |
Each increase of 5 degrees Celsius is equivalent to an increase of 9 degrees Fahrenheit. One can convert in either direction and have the equivalency conserved, though obviously the scale is different.
Since both Celsius and Fahrenheit have negative temperatures, the 0 on the scale is not absolute. Consequently, it does not make sense to say that 25 C is 25% warmer than 20 C.
To see why, note that the change from cool to warm is a 25% increase in temperature in C, but it is only about a
Clearly, something is wrong here, and what is wrong is that there is no absolute 0 for which to calculate ratios. It doesn’t really make sense to say that it is 25% hotter. This may sound pedantic, but it can create real problems in describing growth, which is often described in ratio terms. A statistical liar can use ratios to make numbers dance.
2.4.4 Ratio Data
Ratio data has the additional feature of having an absolute 0. This means that ratios, along with all other standard mathematical operations, are now sensible.
To see this, you could compare salaries denominated in Canadian and U.S. dollars. Assuming an exchange rate of 1.30 CAD = 1.00 US, we would could have the following table of salaries.
| Canadian | US Equivalent |
|---|---|
| $100,000 | $76,923.08 |
| $110,000 | $84,615.38 |
| $130,000 | $100,000 |
The change from $100,000 to $110,000 in Canadian dollars is a 10% increase. Converting these two amounts to US dollars gives the US dollar equivalents of $76,923.08 to $84,615.38, which is also the same percentage increase. Ratio data contains the most ‘information’ and is often the most useful in a modelling context.
2.5 Time Dimensions of Data
In addition to the information content of data that arises as a function of its data type, there may be information in data that comes from its relationship to time. Here, we discuss three commonly occurring categories: cross-sectional, time-series, and panel data.
2.5.1 Cross Sectional Data
Cross sectional data is data where there is no meaningful time dimension associated with the observations. This can occur because all of the data is collected at the same point in time, or because time has no meaning to the data set. While that is the standard definition, I find it more helpful to think of cross sectional data as having no meaningful pattern associated with time amongst the observations in the sample.
For example, I could select 100 individuals and ask them how much money they earned in 2018. This data would be cross-sectional. It is clear that the data applies at a point in time, but there is no meaningful pattern of the data that is associated with time. It should not matter when I asked them what they earned in 2018 – the number will not change.
For cross-sectional data, any legitimate analysis can be performed on the data in any sequence. If you want to calculate the average of the data, you could sort them from smallest to largest and calculate the average or sort them from largest to smallest and calculate the average – in each case, you would get the same results.
2.5.2 Time Series Data
Time series data has a meaningful time pattern associated with the data. This time pattern provides additional information beyond the content of the data itself which can be helpful in many analyses. For example, the daily closing price of a stock over the past year would be time series information. The sequencing of the data in time provides useful information. For example, if the stock were trending upwards, that might tell us something about the performance of the economy. That information would be lost if the data were analyzed in a different order.
Generally, any operation that can sensibly be performed on cross-sectional data can also be performed on time series data, and those analyses can be performed in any sequence to produce the same average (e.g., average, variance, min, max, etc.) But there are additional sensible operations that can be performed on time series data that must be performed in temporal sequence (e.g., trend, forecasting, autocorrelation, etc.)
2.5.3 Panel Data
Panel data combines both cross-sectional and time-series aspects of data in a single data set. A typical panel data set contains observations on a sample taken at many points over time. For example, we could take a sample of 50 Starbucks™ coffee shops across Canada and determine their monthly sales for a five-year period of time.
Within any given time period, the data could be analyzed as cross-sectional, but the data would have time series properties as certain analyses, like trends at individual stores or in aggregate across all stores, which could only sensibly be analyzed if the data was put in temporal sequence.
Panel data has a particularly interesting set of techniques because it allows one to consider how different characteristics of the cross-sectional observations vs. the time-series characteristics contribute to the overall characteristics of data. For example, one Starbucks store may have a great location and typically outperform others; one particular year may have abnormally high sales relative to others and ‘shock’ all the stores for that time period.
But, we are getting ahead of ourselves.
2.6 Final Thoughts (Optional Reading)
In our experience, business people often under-think issues of data collection in general, and sampling in particular, because they tend to have data in systems already and they don’t realize that they need to think through these issues. This can create at least two types of problems. Transactional systems, such as point of sale (POS) systems, may produce mountains of data, but the sample may not reflect the population of interest – actual customers may not represent future customers, data from last year may not reflect this year. Still, these may represent the best possible sources of data.
Beyond this, transactional systems may not contain the variables of interest – and might not be linkable to other systems that do contain the variables of interest. For example, your customer relationship management (CRM) system may have millions of data points on the time taken to resolve a customer service issue, but no field linking that data to the characteristics of the customer who raised it. This suggests that, when designing systems, one should engage in a lot of up-front thinking about sampling, types of data, and how that data might be used in the future so that opportunities for future analysis are not missed.
We generally think of this as being an issue of building data discipline – the basic recognition that data is an asset for the organization, even when its immediate use cannot be determined. I suspect that over the next 10 to 20 years, data discipline will evolve the way that quality systems have evolved in automotive companies since the emergence of Toyota Production Systems to the point where it is a taken-for-granted aspect of most organizations.
Even with a well-established data discipline, occasionally the need to generate data arises in a business context – I seem to get surveys all the time on how I enjoyed my hotel stay when I am in Toronto. As a manager and consumer of data, you need to understand the issues around sampling to avoid common mistakes (e.g., non-representative, too small, bias, etc.) that a badly constructed sample can bring.
Given that data exists, managers need to understand what constitutes an appropriate analysis and how to communicate the results. To this end, a familiarity with summary statistics and the ability to graph and chart these to tell effective stories are very important. A myriad of tools provide many ways to analyze data and create charts – my general advice is to think through what your story is and how you want to tell it before you get caught up in using any of these tools.
Chapter 3 Technical Details - Summary Statistics
This chapter will walk you through common summary statistics you are likely to encounter and use throughout your data analysis journeys. We will discuss what each statistic is, what it is used for, how to calculate it, and the relevant Excel and R functions to accomplish it in practice.
3.1 Measures of Central Tendency
3.1.1 Mean
The mean (or average) represents a central, or typical value for the dataset. To calculate the mean, add all of the observations for a single variable together and divide by the number of observations. \[\bar{x} = \frac{\displaystyle\sum_{i=1}^{n} x_i}{n}\] In Excel, the average can be calculated using the AVERAGE() function. Include all the observations available inside the brackets. See the attached Excel file for an example using the Motor Trend Cars dataset on the Summary Statistics worksheet.
In R, the average can be calculated using the mean function. See the example below in R. Note that we use the head command to review the first 6 rows of data, and then use the mean command to take the average across all observations of the mpg variable.
head(mtcars)## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mean(mtcars$mpg)## [1] 20.09062
3.1.2 Median
The median is another central number, however instead of calculating an arithmetic mean, the median finds a value such that any number is equally likely to fall above or below it. In other words, it is the middle number, separating the higher half of the data from the lower half of the data.
There is no mathematical equation to calculate the median. Instead, you must order your data from smallest to largest and then select the middle value as calculated by \(\frac{(n+1)}{2}\). In the case of an even numbered data set, you might take the mean of the two most central values.
In skewed datasets, the median is often preferable to the mean. One large value can skew the mean significantly, but will not impact the median. In a normal distribution, the mean and median are the same.
In Excel, the median can be calculated using the MEDIAN() function. Include all the data points from a single variable inside the brackets. See the attached Excel file for an example.
In R, the median can be calculated using the median function. See the example below in R.
median(mtcars$mpg)## [1] 19.2
3.1.3 Mode
The mode is our last measure of central tendency, and measures the number that occurs most frequently in the data set, or the number that maximizes the probability mass function. The mode is the number most likely to be sampled from your distribution.
Again, there is no mathematical equation for calculating the mode. Instead, one must count the occurrence of every distinct value in the data and select the value with the highest count. It is possible for a dataset to have multiple modes (and it would be called bimodal if that is the case). A normal distribution will have the same mean, median, and mode.
In Excel, the mode can be calculated using the MODE() function. Include all the variables available inside the brackets. See the attached Excel file for an example.
In R, there is no default command to calculate the mode. Instead, one must install a secondary package or write their own code to calculate it. Here is an example of a user defined function that will calculate the mode. It works by first selecting the distinct values in a vector, then counting the number of times each value occurs, and then returning the value with the highest count.
mode <- function(x) {
key <- unique(x)
key[which.max(tabulate(match(x, key)))]
}
mode(mtcars$mpg)## [1] 21
3.2 Measures of Dispersion
3.2.1 Variance
Variance measures how far a set of numbers are spread from their mean. It is often represented by the symbols \(\sigma^2\) for population variance, or \(s^2\) for sample variance.
The calculation for variance sums the squared distance between each point and the mean of the dataset, and divides it by the number of observations. \[\sigma^2 = \frac{\sum(X-\mu)^2}{N} \;\;or\;\; s^2 = \frac{\sum(X-\mu)^2}{N-1} \]
In Excel, the variance can be calculated with either the =VAR.S() command or the VAR.P() command, based on whether the data represents a sample or a population, respectively.
In R, the sample variance can be calculated as follows. The population variance must be calculated with custom code, which we do not cover here as it is less frequently used.
var(mtcars$mpg)## [1] 36.3241
3.2.2 Standard Deviation
Standard deviation is the square root of variance. It is often expressed as σ for the population or \(s\) for a sample. Standard deviation and variance tell a similar story, but the standard deviation is expressed in relation to the mean of the data, as opposed to a squared value. Variance is more often used from a mathematical point of view, but the standard deviation is much easier to understand.
To calculate standard deviation, we use the same equation for variance, but take the square root of the value. \[\sigma = \sqrt\frac{\sum(X-\mu)^2}{N} \;\; or \;\; s = \sqrt\frac{\sum(X-\mu)^2}{N-1}\] In Excel, you can calculate the standard deviation using the =STDEV.S() command for samples or the STDEV.P() command for populations. In R, the sample standard deviation can be calculated as follows. The population standard deviation must be calculated with custom code, which we do not cover here.
sd(mtcars$mpg)## [1] 6.026948
3.2.3 Ranges, Maximums, Minimums
There are several convenient measures that can be used to better understand the distribution of any variable. The simplest include the maximum, minimum, and the range. There is no mathematical formula for any of these, but they are easily calculated in Excel using the =MAX() and =MIN() commands. The range is simply the minimum subtracted from the maximum. The commands are the same in R, but there also happens to be a range function.
max(mtcars$mpg)## [1] 33.9
min(mtcars$mpg)## [1] 10.4
range(mtcars$mpg)## [1] 10.4 33.9
There are more complex measures of the range of a variable. Most common among them are the quartiles and the interquartile range. Quartiles split the data into 25% increments based on magnitude. To calculate a quartile, sort the data from lowest to highest, and then select the observation in the position equal to the number of observations multiplied by 25%. The interquartile range is simply the 1st quartile subtracted from the 3rd quartile. To calculate these in Excel, use the =QUARTILE() function and pass in all the observations of a single variable, a comma, and then the number of the quartile you are looking for (1, 2, or 3). Naturally, 0 or 4 would return the minimum and maximum respectively.
The same logic behind quartiles can be applied to create other groupings. In our experience, quartiles and deciles are the most frequently asked for measures. Deciles are slightly harder to calculate in Excel, and require the use of the =PERCENTRANK() command.
In R, the quartiles (and any quantile) can be easily calculated:
quantile(mtcars$mpg)## 0% 25% 50% 75% 100%
## 10.400 15.425 19.200 22.800 33.900
3.3 Measures of Relationships
3.3.1 Correlation
Correlation refers to the extent that two variables have a linear relationship. In other words, how much two variables tend to move together. For example, as a child grows taller they also grow heavier; in that way, height and weight are correlated variables.
Correlation is more easily understood by visualizing data – see Scatter Plots in Chapter 4. For now, we will stick to how to measure correlation mathematically. The most common measure of correlation is the correlation coefficient, denoted by \(\rho\). It is bound between -1 (meaning a perfect negative correlation) and 1 (meaning a perfect positive correlation). A value of 0 suggests no correlation at all. To calculate the correlation of two random variables, one can use the following equation:
\[ r_{xy}\quad=\quad {\frac {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{(n-1)s_{x}s_{y}}}={\frac {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sqrt {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})^{2}\sum \limits _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}} \]
In practice, one never actually calculates the correlation coefficient – it is easily done using statistical tools. In Excel, the =CORREL() function will calculate the correlation of any two arrays of data, which we shown on the Summary Statistics worksheet.
In R, the correlation between two variables (here, we use mpg and cyl) can be calculated as follows:
cor(mtcars$mpg, mtcars$cyl)## [1] -0.852162
It is worth noting one last thing – correlation does not suggest causation. In the example above, we see a correlation of -0.85 between miles per gallon (mpg) and engine cylinders (cyl) – this intuitively makes sense. As engines get larger, they get worse mileage. But, would you ever suggest that as a car’s miles per gallon increases that its engine get smaller?
Chapter 4 Technical Details - Charts and Tables
This chapter will discuss various graphing techniques you can use to represent your data, depending on the story you would like to tell with it. We tend to group graphs into one of four categories based on the story they are able to present: comparisons, distributions, compositions, and relationships.
4.1 Comparisons
Comparisons always consider two or more variables and the differences between them. Good comparisons allow you to easily identify maximums and minimums and tie them back to categories. For comparisons between categories, we recommend you stick to bar charts. For comparisons over time, you can use line charts, bar charts, or radar charts.
4.1.1 Bar Charts
Bar charts are used to represent an aggregate measure corresponding to a categorical variable. Typically, one would include categorical variables along the x-axis and the aggregate measure (i.e., sum, count) on the y-axis.
We will be using tidyverse::ggplot in R to produce the graphs. In this case, for the bar chart, we will use the geom_bar() option. See the Charts worksheet in the Excel file for how to create the same graph in Excel.
ggplot(mtcars, aes(x=cyl)) + geom_bar()4.1.2 Line Charts
Line charts are usually used to represent the trend of one or more variables over a time component. Usually, your variable of interest is represented on the Y-axis, and the time component is represented on the X-Axis.
The default plot function in R does a good job of producing line charts with time series data, as seen below. See the Charts worksheet in the Excel file for a condensed version of the same chart with two stock indexes on the same graph for comparison.
plot(EuStockMarkets)4.2 Distributions
Distribution speaks to how data is spread out or grouped, in much the same way you think about normal or binomial distributions. Distribution charts are usually relatively simple. For single variables, you should think about using histograms. For two variables, think about scatterplots or heat maps.
4.2.1 Histograms
Histograms are our go-to chart for understanding the distribution of an underlying variable. A histogram displays the frequency of data points within certain ranges, often called buckets. You can define your own buckets for a histogram depending on what you would like to show. Having too big a bucket may hide some of the underlying distribution.
ggplot(mtcars, aes(mpg)) + geom_histogram(binwidth=5)In this example, we set the bin width to 5. The height of the bar represents the count of observations that fall within a certain range. For example, the first bar has a height of two, meaning that two observations have ‘mpg’ between 7.5 and 12.5. See the Charts worksheet in the Excel file for a similar histogram created using the Data Analysis Toolpack. You should be able to see how we created it by selecting the Histogram option in the Data Analysis Toolpak. Note that if you want custom bin ranges, you must create them prior to using the Histogram function in Excel.
4.3 Compositions
Compositions show how individual parts make up a whole. For example, you might use a chart to visualize how all of your products contribute to a total revenue number. Unfortunately, most people lean on pie charts for this particular type of visualization – please don’t. Humans are notoriously bad at evaluating the relative sizes of circles and interpreting angles (if you do not believe us, what is longer? The circumference or the height of a can of Diet Coke?).
4.3.1 Waterfall Charts
A waterfall chart is an easy way to visualize a starting point, and then the changes from that point based on categories. It is often used to break down revenues, costs, and profits. You might imagine a chart with revenue from multiple business lines, the different costs associated with running your business, and the final bar visualizing total profit.
Creating a waterfall chart in R is well beyond the scope of the course, but is fairly easy to do in Excel. See the Charts worksheet in the Excel companion for details.
4.3.2 Area Charts
Area charts come in two flavours – stacked and unstacked. An unstacked area chart is very similar to a line chart, but is filled underneath. These are used primarily to represent cumulative totals over categories or time. A stacked area chart acts similarly to a waterfall chart and is used to show how multiple categories add up to a total, also over categories or time.
Again, area charts in R are out of scope for the course, but are easy to create in Excel. See the Charts worksheet in the Excel companion for details.
4.4 Relationships
Relationships aim to show the connection or correlation between variables. These charts aim to identify correlations, trends, patterns, and clusters. Typically, we use scatterplots for two variables and bubble charts for three variables. For example, you may want to understand how supply affects demand, or interest rates affect profit.
4.4.1 Scatter Plots
Scatter plots are the most common bivariate graphing technique. They show a single point for each observation across an X and a Y variable. Used most commonly to show correlation between variables, these charts also allow for the addition of a line of best fit, which is really your first glimpse at a simple linear regression! But, we are getting ahead of ourselves – more on regression later.
Naturally, more complex forms are available. 3D scatter plots allow you at add a 3rd axis, while a 3rd variable can be added by either colouring points, changing their shapes, or turning them into bubbles (as explained in the next section).
ggplot(mtcars, aes(x=hp,y=mpg)) + geom_point()See the Charts (Part 2) worksheet in the Excel companion file for an example of the Scatterplot.
4.4.2 Bubble Charts
Bubble charts are very similar to scatter plots, but add a third variable represented by the size of the dot. The bubbles are scaled relative to the distribution of the third variable. Bubble charts are more difficult to understand than a classic scatter plot, so should be used sparingly. We have rarely seen bubble charts used effectively, but there are good use cases. A personal favourite is from The Economist and visualizes airport commutes with distance to city on X, population on Y, and size of airport as the bubble.
ggplot(mtcars, aes(x=hp,y=mpg, size=disp)) + geom_point()See the Charts (Part 2) worksheet in the Excel companion file for an example of Bubble Charts.
4.5 Pivot Tables
Pivot tables are one of Excel’s most powerful features. In Excel, one can use pivot tables to easily transform, aggregate, and summarize an entire dataset. With the addition of filters and slicers, you can easily create dynamic analysis tools that will allow you to get multiple cuts of the same data in real time.
Pivot tables are best seen instead of explained. See the Excel companion worksheet “Pivot Tables” for a detailed look.
Chapter 5 Probability
This chapter deals with probability, which is to say, the mathematical modelling of random or uncertain events. An understanding of probability provides an important foundation for many of the tools used in analytics and artificial intelligence. Beyond understanding, probability helps one avoid common fallacies in probabilistic thinking – and there are many of
We will start our discussion of probability with some simple ideas about how probabilities behave and how they are derived from real world events. From there we will move on to a discussion of sets and set theory. Set theory will give us a way to develop an intuitive understanding of probability as area. It will also give us the language to develop the basic rules of probability, which we will develop as far as Bayes’ Theorem.
We will then focus on three methods of solving probability problems: Tables, Trees and Formulas. We will demonstrate the solution to several problems using these techniques. Finally, the chapter will wrap up with some optional managerially-focused discussion on probability and its relation to business.
A final word before we begin. Few students ever really master probability and pretty much everyone who has ever studied probability finds it difficult and frustrating along the way. If you find it difficult and frustrating, take heart: that fact says very little about you or the likelihood that you will eventually master the topic.
5.1 From Reality to Probability
When talking about probabilities, we are really talking about the probability that something within some possible set of outcomes. We refer to things that can happen or not as being events and the collection of all events as being the sample space. The process of selecting an event from all the events in the sample space is called an experiment.
Experience suggests that simple examples are the best way to explain these concepts. So we might consider the experiment of rolling a single standard dice. The list of possible events are rolling a 1, 2, 3, 4, 5, or 6 - these events are mutually exclusive (i.e., they cannot happen at the same time) and collectively exhaustive (i.e., no other options are possible) (MECE). They are also special in that they are what are called elementary events – they cannot be decomposed into smaller events.
The elementary events can be combined together to form more complex events. For example, the event ‘roll an odd number’ corresponds to the elementary events roll 1, 3, or 5. The event ‘roll an even number’ corresponds to the events roll 2, 4, or 6. Rolling Even and Rolling Odd are mutually exclusive but they are not elementary events because they can be decomposed into other events in the sample space.
More interesting experiments might involve attempting to get a job offer. Suppose a company puts each candidate through three interviews and if two or more interviewers are impressed, a job is offered. The experiment might be called ‘apply for a job’, the sample space might be impress 0, 1, 2, or 3 interviewers. The event ‘getting an offer’ is equivalent to either of the elementary events ‘impress 2 interviewers’ or ‘impress 3 interviewers’ occurring.
Given that we have events and some ideas about how to organize them, we have to ask the question, “How does probability enter into the story?” It turns out there are three traditional ways to think about probability: classical, relative frequency, and subjective.
5.1.1 Classical Probability
Classical probability involves defining the probability associated with elementary events and then calculating the probability of more complex events by adding up the probability of the elementary events that define the more complex events. Since elementary events are MECE and represent events that cannot be subdivided, every event is composed of a collection of elementary events.
The elementary events in a sample space do not have to have equal probability, but the classical approach assumes that they do. This makes it an effective method for studying gambling outcomes and simple textbook problems such as rolling dice or flipping coins.
For example, I could take a six-sided dice and label the sides A, B, C, D, E, and F. If I rolled the die, I would expect an equal probability of each outcome, so each possible outcome would have a probability \(1/6\). For example, the probability of getting B would be \(1/6\).
The drawback for classical probability is that the individual probabilities have to be equal for elementary events. This makes sense for experiments like rolling dice, but cannot be justified with experiments like impressing a prospective hiring manager.
5.1.2 Relative Frequency
When elementary events cannot be treated as having equal probability, one cannot rely on the classical approach to establish the probability. In this case, it may be possible to look at how frequently an event occurs given the number of times it could have occurred under similar circumstances, and use this to determine the probability.
For example, I could set up a multiple choice question with six possible outcomes, only one of which was correct. I could then think of the experiment of a single student answering that multiple choice question. There would be six possible answers, \(\{A, B, C, D, E \;and\;, F\}\) – but it would not be reasonable to think that each answer was equally likely. Suppose the correct answer was B; hopefully B would be more probable than other options.
To determine the probability that B would be selected by a random student, I could look at historic data on how students answered this question. Since my students are awesome, they chose B exactly 99 of the past 108 times the question arose. Based on this information, we might conclude that the probability of a student selecting the correct answer is \(P(Correct \;answer\; selected) = P(B) = 99 / 108 = 11 / 12 \approx 0.9166\). Using historic data, we could calculate similar probabilities for the other answers (i.e., elementary events) and use those to calculate the probability of compound events. One thing should be clear upon reflection: the remaining options would share the probability \(9 / 108\) between them.
The relative frequency approach highlights one interpretation of what a probability is, namely the long-run average rate at which an event will occur in a particular experiment. This interpretation is perfectly valid in situations where it applies, but as we will show below, it sometimes seems a bit hollow.
Relative frequency resolves one of the challenges associated with the classical method by removing strong assumptions about the probability of elementary events. Unfortunately, one does need to make assumptions about what constitutes a similar situation and care must be taken to ensure the situations being considered are and in sufficient number to get an accurate reflection of the probability. One thing to be careful of is that the probabilities calculated using the frequency approach suggest an accuracy that is not warranted.
5.1.3 Subjective Probability
Clearly, there are situations where either the Classical Method or Relative Frequency are perfectly appropriate. In other situations, the events are so rare or are possibly unique so that frequency loses any practical meaning. In this case, elementary events cannot be assumed to have equal probability either. Suppose there was an election, such as the US primaries, where the contest starts with six candidates, \(\{A, B, …,F\}\).
To establish the probability that candidate B would win, I’d have make an educated guess. This is called a subjective assessment. There is nothing wrong in principle with basing probability on an educated guess, but evidence suggests that people are very bad at doing this. Worse, when multiple probabilities are assessed subjectively, someone may construct estimates that are mutually inconsistent.
For example, I might think that my favorite candidate, candidate B has a 60% chance of winning and that her arch rival candidate C has a 30% chance. I may think that if B does not win, then D is the next best candidate and that if B loses, D has a 40% chance of winning. It may not be obvious, but these numbers are not This suggests that care must be taken when using subjective assessments of probability.
Regardless of how probabilities are established, the same rules of probability must be followed. The first is that probabilities are strictly bounded between 0 and 1, which is say that every event in the sample space must have a probability between 0 and 1 inclusive. The second is that something must happen, which says that that the sum of probabilities over all elementary events must add to one.
We will encounter other rules shortly, but to do so we must introduce the language of probability, which is based on the mathematical language used to describe sets and mathematical operations that are applied to sets. We have already encountered a few words of this language such as experiments, and events. We will now introduce some more, but don’t worry, we will keep it as simple and non-technical as possible.
5.2 Set Notation and Probability Rules
A set is a collection of things. For our purposes they will generally be things that could happen or not in a particular situation. That collection could be finite or infinite in size as long as we can clearly determine whether any particular element is in the set or not. We will generally denote sets with letters and either list all the members of the set or provide a formula that defines membership.
For example, we could define a set of popular pets as: \[A = \{Great\; Dane, Miniature\; Schnauzer, Australian\; Sheep\; Dog, Cat, Goldfish\}\] A set of small dogs as: \[B = \{Pug, Toy\; Poodle, Miniature\; Schnauzer\}\] the set of possible outcomes on a standard dice roll as: \[C = \{1, 2, 3, 4, 5, 6\}\] or the set of the number of times you could flip heads in a row on a coin as: \[D = \{0, 1, 2, 3, 4, …\}\]
Whatever the elements of a set are, we can refer to them by their position in the set as indexed values. The notation here varies, but we will generally use variables with subscripts. For example, we could refer to the set of popular pets as containing elements \(x_1,\;x_2,\;...\;x_5\). In this case, \(x_3 = Australian\; Sheep \; Dog\).
This kind of notation allows us to make general comments about sets without getting bogged down by the details. This makes it easy to specify general rules. If we name each of the n elementary events in a sample space as \(x_1,\; x_2, \; ... \; x_n\) we can restate our two probability rules more formally as: \[ 0 \leq p(x_i) \leq 1 \] where \(p(x_i)\) is the probability of any event \(x_i\). This is simply ‘math speak’ for each ‘event has a probability between 0 and 1 inclusive’.
The second rule is that the sum of the probability over all n elementary events adds up to 1, can be stated more formally in math-speak as: \[\sum\limits_{i}^nP(x_i) = 1 \] If you are unfamiliar with the notation, don’t worry. \(P(You\;get\;the\;hang\;of\;it\;before\;the\;end) > .98\).
5.2.1 Set Operations
Like numbers, sets have mathematical operations. In this section, we will define a few basic ones which we will use to build up more complex ones. The most important for us are: intersection, union, sample space and complement.
5.2.2 Intersection
We will denote intersection of sets by a \(\cap\) or the word The intersection of two sets indicates all objects that are found in both sets. So the intersection of \[A = \{Great\; Dane, Miniature\; Schnauzer, Australian\; Sheep\; Dog, Cat, Goldfish\}\] and \[B = \{Pug, Toy\; Poodle, Miniature\; Schnauzer\}\] would be: \[A\; AND\; B = A \cap B\] \[= \{Great\; Dane, Miniature\; Schnauzer, Australian\; Sheep\; Dog, Cat, Goldfish\} \cap\] \[\{Pug, Toy\; Poodle, Miniature\; Schnauzer\}\] \[= \{Miniature\; Schnauzer\}\] So in English, intersection might best understood as and in the sense that the intersection of two sets are those objects in the first set and in the second set.
5.2.3 Union
The union of two sets is indicated by \(\cup\) or the word OR. The union of sets indicates all objects that are found in any of the sets. So the intersection of E and F: \[E = \{Even\;Dice\;Rolls\} = \{2,\;4,\;6\}\] \[F = \{Below\;Average\;Dice\;Rolls\} = \{1,\;2,\;3\}\] \[E \cup F\] \[= \{2,\;4,\;6\}\: \cup \: \{1,\;2,\;3\}\] \[= \{1,\;2,\;3,\;4,\;6\}\]
So in English, the union is best understood as or in the sense that the union is the collection of objects that is in one set or in the other set.
5.2.4 Sample Space
We have encountered the sample space already. Now we can be a bit more formal in defining it. The sample space, often denoted \(\Omega\), indicates all the elementary events in an experiment.
So for a standard dice role, the sample space would be denoted: \(\Omega = \{1,\;2,\;3,\;4,\;5,\;6\}\).
With this notation, we can formally define a collectively exhaustive set of sets as being ones such that the union of those sets is equivalent to the sample space. More formally, if \(A_1,\; A_2,\;...,\;A_k\) are sets, then the sets \(A_1,\;...\;,A_k\) are collectively exhaustive if and only if
\[A_1 \cup A_2 \cup ... \cup A_k = \Omega\]
5.2.5 Null Space
The null space is the empty set which we will denote with the word ‘Null’. The null set is not the set with the value 0, the null set is empty – it contains no elements. From our ‘something must happen rule’, it is immediately apparent that \(P(Null) = 0\).
The intersection operation and the idea of the null space, allow us to formally define mutually exclusive sets. Sets are mutually exclusive if none of them contain elements found in another set. More formally, two sets, A and B, are mutually exclusive iff \(A \cap B = Null\).
5.2.6 Complement
The complement of a set is the set of all of those things that are in the sample space but are not part of the set in question. So, the complement of the set contains only those elements that would have to be added to the set in order to complete the sample space.
In the dice role experiment, the complement to the even dice rolls is the odd dice roles. To see this, note that if \(E = Even\; Dice\; Rolls = \{2,\;4,\;6\}\) and \(O = Odd\;Dice\;Rolls = \{1,\;3,\;5\}\) then: \[E \cup O = \{2,\;4,\;6\} \cup \{1,\;3,\;5\} = \Omega = \{1,\;2,\;3,\;4,\;5,\;6\}\] and \[E \cap O = Null\] Any event together with its complement collectively exhausts the sample space. Based on the ‘something must happen rule’ that means: \[P(Event \cup Complement) = 1\] Since any event and its complement are by construction mutually exclusive we can decompose this into: \[P(Event) + P(Complement) = 1\] This can be rearranged to form the complementary rule: \[P(Event) = 1 - P(Complement)\] Complements occur so frequently in probabilities that we often denote the complementary set with special notation. or more generally, the logical operator NOT. The rule can be specified as: \[P(A) = 1-P(\sim A)\] You will find that in determining the probability of complex events, it is often easier to calculate the probability of the complement and then use the complementary.
With the language of sets in place, we can define some important concepts for probability. These are probability statements rather than simply set theory statements. We will start with three key probability ideas: marginal, joint, and conditional, and then build on those.
5.2.7 Marginal Probability
The marginal probability is the probability that an event will occur given that we know nothing else except that the experiment has been run. So if we flip two coins, the sample space would be \(\{TT,\;TH,\;HT,\;HH\}\). We could identify the \(B = \;'Both\; are\; Heads' = \{HH\}\) as an event of interest for which we could calculate the probability. We would indicate this probability as: \[P(Both\; Are\; Heads) = P(B)\] In this case, the classical method could be used since the elementary events are equally probable. The probability would be \(1/4\) since only one of the four possible outcomes belongs to the event of interest.
5.2.8 Joint Probabilities
These are the probabilities that two events both happen. So in the two coin toss experiment we have been discussing, we could consider the event \(H1 = The\; First\; Coin\; is\; Heads\) and \(H2 = The\; Second\; Coin\; is\; Heads\)
From your knowledge of set theory, it should be obvious that the intersection of the events ‘The First Coin is Heads’ and ‘The Second Coin is Heads’ is the event ‘Both Coins are Heads’. In more formal notation, we can say: \[B = H1 \cap H2\] Which means that: \[P(B) = P(H1 \cap H2)\]
It may seem obvious to you that to calculate the probability \(P(B)\) you would simply add up the probabilities of each of the two events \(P(H1)\) and \(P(H2)\). Since these are mutually exclusive events, that would work, but in general that approach does not work.
To see why this doesn’t always work, consider that in general, events are made up of elementary events. For example, in the standard dice role experiment, the elementary events are \(\{Roll\; a\; 1, Roll\; a\; 2,\; …, \;Roll\; a\; 6\}\). More complex events are made by grouping together elementary events. Roll an Even Number is made up of the union of three elementary events. More formally: \[Roll\; An\; Even\; Number = Roll\; a\; 2 \cup Roll\; a\; 4 \cup Roll\; a\; 6 = \{2,\;4,\;6\}\] Similarly \[Roll\; Higher\; Than\; Four = Roll\; a\; 5 \cup Roll\; a \; 6 = \{5,\; 6\}\] From these definitions, we can determine that the \(P(Roll\; an\; Even\; Number) = 3/6\) and the \(P(Roll\; Higher\; Than\; Four) = 2/6\). But it should be obvious that: \[P(Roll\; an\; Even\; Number \cup Roll\; Higher\; Than\; Four) \neq \] \[P(Roll\; An\; Even\; Number) + P(Roll\; Higher\; Than\; Four)\] Strictly speaking, the reason why is that the elementary event ‘Roll a 6’ belongs to both ‘Roll an Even Number’ and ‘Roll Higher Than Four’ so the probability associated with rolling a four is being double counted. To correct this, we need to adjust our probability addition formula to be: \[P(A \cup B) = P(A) + P(B) - P(A \cap B)\] Where the last term adjusts for the probability associated with those elementary events found in both sets A and B. Doing so we would find: \[P(Roll\; an\; Even\; Number \cup Roll\; Higher\; Than\; Four) = \] \[P(Roll\; an\; Even\; Number) + P(Roll\; Higher\; Than\; Four) - \] \[P(Roll\; an\; Even\; Number \cap Roll\; Higher\; Than\; Four)\]
Naturally, if events A and B are mutually exclusive, as they were in our coin toss example, \(P(A \cap B) = 0\) so no adjustment needs to be made.
This example illustrates the additive law of probability. It also, incidentally, illustrates why it is a lot more convenient to use letters rather than full words to describe events.
5.2.9 Conditional Probability
Where marginal probability were based only on knowing the experiment took place, conditional probabilities are based on having additional information. For example, in the coin tossing experiment if you knew that the first coin came up heads, you would have more information to use to estimate the probability that both coins were heads. We write conditional probabilities as \(P(Event\; of\; Interest\; | Events\; That\; Have\; Occurred)\), so we would write out the conditional probability of getting two heads given that the first coin was a head as: \[P(Two\; Heads | First\; Coin\; is\; Heads)\] For a more concise representation, we would define the events as \[B = Both\; Heads,\; H1 = First\; Coin\; is\; Heads\]
and write the probability as: \(P(B|H1)\)
Conditional probability statements always follow the same format. The first event in the parentheses is the event whose probability you want to know. The second is the event that has or whose occurence you want to condition on.
Conditional probabilities can be calculated using the multiplication rules for probabilities: \[P(A \cap B) = P(A|B)P(B)\] Which can be rearranged to get: \[P(A|B) = \frac{P(A \cap B)}{P(B)}\] This rule is difficult to prove but easy to demonstrate. Suppose we wanted to calculate the probability of getting a 1 on a standard dice roll given that we know we rolled an odd number. Given the sample space \(\Omega = \{1,\;2,\;3\;4,\;5,\;6\}\)
Let \(A = Rolled\; a\; One =\{1\}\) and \(B = Rolled\; an\; Odd\; Number = \{1,\;3,\;5\}\).
We can calculate: \[Rolled\; an\; Odd\; Number \cap Rolled\; a\; One\] \[A \cap B\] \[A\] We know \(P(A \cap B) = P(A) = 1/6\) and that \(Rolled\; an\; Odd\; Number = B\) and that \(P(B) = 3/6\). So: \[P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{1/6}{3/6} = \frac{1}{3}\] That was a very easy example. You try one that is a bit harder.
5.2.9.1 What is the probability of rolling an odd number given that you rolled a number that was below average?
Hint 1: What is a below average dice roll? Hint 2: What are my events? Hint 3: What are the probabilities? Hint 4: How do I plug this into the formula?Naturally, these things get much more complex. But you are off to a good start!
5.2.10 Independent Events
Independent events are a very important special case of events in probabilities. Two events are independent iff the occurrence of one of the events does not change the probability that the other event will occur. Based on the notation we have developed so far, we can say that two events, A and B are independent iff: \[P(A|B) = P(A)\] It turns out that if \(P(A|B) = P(A)\) you can use the other probability rules to show that \(P(B|A) = P(B)\), so if A is independent of B, then B is independent of A.
Given this result, you can show that if A and B are independent then: \[P(A \cap B) = P(A)P(B)\] How, you ask? Let’s start with: \[P(A|B) = \frac{P(A \cap B)}{P(B)}\] Which we can rearrange to say: \[P(A \cap B) = P(A|B)P(B)\] Since there is nothing special about events A and B, we can swap them: \[P(B|A) = \frac{P(B \cap A)}{P(A)} = \frac{P(A \cap B)}{P(A)}\] \[P(A \cap B) = P(B|A)P(A)\] We can combine these to get: \[P(A \cap B) = P(B|A)P(A) = P(A|B)P(B)\]
Independence implies that \(P(A|B) = P(A)\) so substituting this in we get: \[P(B|A)P(A) = P(A|B)P(B) = P(A)P(B)\] Divide the whole thing by \(P(A)\) and: \[P(B|A) = P(B)\]
5.2.11 Mutually Exclusive Events
We introduced the idea of mutually exclusive events above, but experience suggests that it is worth revisiting the concept in light of independent events. For some reason, many people get confused about the difference between mutually exclusive and independent events.
Mutually exclusive events are ones where the occurrence of one precludes the occurrence of the other. In other words, if A and B are mutually exclusive, then \(P(A|B) = 0\) and \(P(B|A) = 0\).
Independent events are ones where the occurrence of one tells you nothing about the probability of the occurrence of the other, so \(P(A|B) = P(A)\) and \(P(B|A) = P(B)\).
Clearly these are not the same thing. Make sure you intuitively understand these ideas. Here are a few quick questions to make sure you get it:
5.2.11.1 You flip two distinct coins: the events “Getting Heads on the First Flip”" and “Getting Heads on the Second Flip”" are: independent, mutually exclusive, neither?
5.2.11.2 You roll a single dice: the events “Getting an Odd Number”" and “Getting a Below Average Number are independent, mutually exclusive, or neither?
5.2.12 Rules Summary
You probably noticed – we snuck in a bunch of probability rules while explaining the concepts. It seems only fair to provide a summary of them in one place for your reference. These rules each have a general and specific form, the specific form applies to events that are either independent or mutually exclusive. We do not find memorizing these additional rules to be helpful, but have added them for completeness.
5.2.12.1 Addition Rule
The addition rule is used when you want to calculate the probability of two events, either of which could occur.
The general case: \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
And the special case, when A & B are mutually exclusive: \(P(A \cup B) = P(A) + P(B)\)
5.2.12.2 Multiplication Rule
The multiplication rule is used to calculate the probability of two events happening simultaneously.
The general case: \(P(A \cap B) = P(A|B)P(B) = P(B|A)P(A)\)
And the special case, when A and B are independent: \(P(A \cap B) = P(A)P(B)\)
5.2.12.3 Complementary Rule
The complementary rule is used when you know or can easily calculate the probability that the event of interest will not happen.
There is only one case: \(P(A) = 1-P(\sim A)\)
5.2.12.4 One more
This rule does not get ‘law’ status, but is still convenient. \[P(B) = P(B|A)P(A) + P(B|\sim A)P(\sim A)\]
5.2.13 Bayes’ Theorem
There is one final probability rule to cover before we solve problems. This is the celebrated (or dreaded!) Bayes’ Theorem. Bayes’ Theorem is used on problems involving conditional probability to allow one to swap the ‘direction’ of the conditioning probability. It is often expressed in two forms, both of which are indicated in the equation below. \[P(A|B) = \frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{P(B|A)P(A)+P(B|\sim A)P(\sim A)}\]
5.2.14 Optional Material - Bayesian Statistics
5.3 Solving Probability Problems
There are three commonly used methods for solving probability problems. Tables, Trees and Formulas. In a moment we will dig into each of the three. Tables are easy to use, but limited in scope. Most students prefer to start with trees for solving problem, but most advanced users prefer formulas. We recommend that you start with trees if possible to get a sense of how probabilities work. You may want to switch to formulas as you get more comfortable with trees.
5.3.1 Tables
Tables are occasionally useful when you have a lot of data that can be organized into a fairly small number of categories. Basically, you create a table that maps on to those categories, count the number of observations that correspond to the event of interest and use basic math to calculate conditional and marginal probabilities associated with various events.
‘Real world’ table problems tend to be based on a lot of data and result in tables produced by reporting software such as a Pivot Table in Excel, Python, or R. Textbook problems tend to look more like Sudoku problems – a table is provided where some values are missing, the missing values are inferred from the rest of the table and then probabilities are calculated.
We think you are capable of completing a table or processing data with a pivot table, so we will assume you can complete a table (or do what we do – assign it to a minion!)
5.3.1.1 Sample Problem - Hotel Survey
The hotel you work for collected data on 100 randomly sampled guests. While many pieces of data were collected, responses to two questions are summarized in the following table. The questions were ‘Did you feel the room was large enough?’ and ‘Do you anticipate returning to the hotel within a year?’
| Question | Plan_To_Return | Do_Not_Plan_To_Return | Total |
|---|---|---|---|
| Room Large Enough | 6 | 19 | 25 |
| Room Not Large Enough | 26 | 49 | 75 |
| Total | 32 | 68 | 100 |
Use the information to answer to calculate the following probabilities:
How likely is it that a customer plans to return to the hotel?
How likely is it that a customer who plans to return to the hotel finds the room size too small?
Should we be concerned that customers are not returning because the rooms are too small?
Solutions
Here we are being asked for the marginal probability that a guest plans to return to the hotel. We might write this as \(P(Return)\). Based on relative frequency, we can calculate this probability as: \[P(Return) = \frac{Number\; of\; Guests\; Who\; Plan\; to\; Return}{Number\; of\; Guests\;} = \frac{32}{100} = 0.32\]
This question is asking for the conditional probability that a guest who plans to return found that the room was too small. We might write \(P(Small | Return)\). To calculate this, we need to consider only those who plan to return, so the probability based on relative frequency is: \[P(Small | Return) = \frac{26}{32} = 0.8125\]
The last question is a business question – a trivial one perhaps, but it will illustrate the thinking. These types of questions often have multiple possible answers and normally one has to provide an explanation of what one found and why.
Here one might think that customers are not returning if they think the room is too small – so you might want to consider if customers are less likely to return if they found the room was too small. This probably won’t solve your problem, but it may give you some insight. So we want to calculate two probabilities: \(P(Return|Small)\) and \(P(Return|Large)\). Using similar logic to that used above \(P(Return|Small) = \frac{26}{75} \approx 34.667\) and \(P(Return|Large) = \frac{6}{25} = 0.24\). This result appears to suggest that the probability of returning is not lower for those who believe the room is too small.
The result may seem odd to you, but there are any number of justifications as to why this might be the case. Perhaps the location of the hotel or the price are much more important factors in determine whether someone will stay or not. Perhaps frequent travelers are more critical of room sizes but are nevertheless more likely to return to the hotel.
This should highlight two important issues. The first is that a problem is seldom solved with the first question – asking questions are far more likely to lead to more questions, and a deeper understanding along the way, than they are to provide an immediate solution. The second is that explaining behavior (intention to return) in terms of a single variable (room size) is unlikely to be successful. Fortunately, there are more advanced models that could effectively determine \(P(Return | Room\; Size,\; Price,\; Family\; Size,\; Business\; /\; Personal\; Traveler,\; etc.)\) We sincerely hope that you will stick around for later courses that cover this material.
One final thing to be aware of with tables-based probabilities. While they are presented as probabilities, they are really just estimates and are subject to all the sampling problems and sample size problems that were discussed in the chapter on samples.
5.3.2 Trees
Trees help solve probability problems by providing a structure which reminds you what to look for and reduces the need to use formulas from memory. Early on in your study of probability, they help you develop some intuition of how probabilities and events are decomposed.
Unfortunately, trees require more overhead and become unwieldy for complex problems. To the degree that they are substitutes for formulas, they also detract from mastering material that would be helpful in future studies. So we recommend starting with trees but moving to formulas as soon as you feel ready.
Trees are used by decomposing a problem into a collection of events. The tree structure is built out starting with the event that is used to condition other events. For example, if you wanted to know the probability of a customer purchasing your product given that she has previously purchased the product, the event of interest would be purchasing and the conditioning would be that she already purchased a product from you.
The tree structure itself looks like this:Figure 5.1: Probability Tree
The logic works by splitting the tree such that every node produces an event and its complement. This means that the branches add up to one in specific ways, as shown in the diagram. You start with the conditioning event and write out other events moving from right to left. One simply plugs in known values and uses the logic of the tree to fill in unknown values.
At the rightmost part of the tree are a collection of MECE events whose probabilities can be added to determine the probabilities of compound events.
5.3.2.1 Sample Problem with Trees
As an independent consultant, I have the opportunity to bid on two projects with the same company. One project to develop an advanced forecasting system, the other is to develop a basic forecasting system. Selling both systems would be great, since both generate revenue, however, the advanced system is the one I really want to sell, because it pays much better. Since the basic system is cheap, customers are about 80% likely to take it, but then the probability of taking the advanced system is only 30%. On the other hand, if they refuse the basic system, they will almost certainly take the advanced system – I’d guess it is about 95% likely.
- What is the probability I will sell the basic system?
- What is the probability I will sell the advanced system?
- What is the probability I will sell both systems?
To solve this problem, we are going to use the following 5 steps:
- Read and understand the problem and what is being requested.
- Identify the events of interest in the scenario or problem.
- Define events in formal notation.
- Note the relevant probabilities that are immediately available from the data.
- When using trees:
- Set up the structure so that you can identify the events of interest.
- Write the immediately available probabilities into the tree structure.
- Use the tree structure to fill in the rest.
- Pick out the probabilities that solve your problem and explain the results.
Let’s do each together.
- Read and understand the problem and what is being requested.
I think I understand it – I need to determine the likelihood of selling different types of systems. The issue appears to depend on whether or not I sell the basic system. This is important because I want to start the tree with the conditioning event.
- Identify the events of interest in the scenario or problem.
- Define events in formal notation.
The problem is straight forward, so I will do 2 and 3 together.
\[A = Sell\; Advanced\; System\] \[B = Sell\; Basic\; System\] - Note the relevant probabilities that are immediately available from the data.
\[P(A) = \;?\] \[P(B) = 0.80\] \[P(A|B) = 0.30\] \[P(A|\sim B) = 0.95\] - When using trees:
- Set up the structure so that you can identify the events of interest.
- Write the immediately available probabilities into the tree structure.
- Use the tree structure to fill in the rest.
- Pick out the probabilities that solve your problem and explain the results.
The conditioning probability is B so I’ll start with that. Writing the probabilities into the tree will give me.
From this, I can derive all the other values and fill in the full tree.
Finally, I can read off the values associated with the events of interest:
\[P(Sell\; the\; basic\; system) = P(B) = 0.8.\] \[P(Sell\; the\; advanced\; system) = P(A) = P(A \cap B) + P(A \cap \sim B) = 0.24 + 0.19 = 0.43\] \[P(Sell\; both) = P(A \cap B) = 0.24\]
5.3.3 Formulas
If you progress in probabilities far enough it is very likely that you end up using formulas. They can generally handle more complex problems than trees with less work. They also provide a stronger grounding in probability for more advanced material. On the down side, they do take some practice to use.
Let’s repeat the problem above with formulas. We will follow a very similar approach to the one we used above for trees. The final step will be changed to reflect formulas rather than tree structures.
- Read and understand the problem and what is being requested.
- Identify the events of interest in the scenario or problem.
- Define events in formal notation.
- Note the relevant probabilities that are immediately available from the data.
- When using formulas:
- Write out the immediately available probabilities in formal notation.
- Write down the formulas you eventually need to answer the question.
- Manipulate formulas as required to answer the question.
- Explain the results.
We have already done steps 1-4, so we will jump straight to 5. \[P(A) = ?\] \[P(B) = 0.80\] \[P(A|B)=0.30\] \[P(A| \sim B) = 0.95\]
We want to know the probability of selling the basic system: \[P(Sell\; Basic\; System) = P(B) = 0.8\] The probability of selling the advanced system: \[P(Sell\; Advanced\; System) = P(A) = P(A \cap B) + P(A \cap \sim B)\] We can use the multiplication rule to fill these in: \[P(A \cap B) = P(A|B)P(B) = (0.3)(0.8) = 0.24\] \[P(A \cap \sim B) = P(A|\sim B)P(\sim B) = (0.95)(0.20)=0.19\] \[P(A) = P(A \cap B) + P(A \cap \sim B) = (0.24)+(0.19) = 0.43\] And finally, the probability that we sell both. We already calculated this above. \[P(Sell\; Both) = P(A \cap B) = .24\]
Using formulas clearly required more knowledge of probability rules, but it was quite a bit quicker because we did not need to draw as much, nor did we have to calculate probabilities that we did not end up using.
5.4 Managerial Discussion
In writing this text book, we wanted to draw on our professional experience as much as our academic training. It seems that probabilities is the most challenging area to land on well-grounded advice.
Our experience suggests that understanding the basics of probability is important for the effective use of analytics and AI tools. Probability is an area of mathematics that is deceptively difficult for most people. Our intuitions on probability regularly fail us resulting in bad decisions. Knowing this, along with knowing about biased thinking can help identify potential mistakes, but only formal applications of probability offers the prospect of making the right choices.
Beyond this, a formal understanding of probability is important for developing an intuitive understanding of the techniques of analytics and AI, along with their considerable limitations. It is also a significant component of the language used in help files and support documentation – much of which is written by technical experts. If your interests are technical in nature, a weak understanding of probability may become a limiting factor in your professional development.
All that said, the actual calculation of probabilities and use of the formulas is at best a part of a specialist’s skillset. The calculations are increasingly built into easy-to-use software and the probabilities used by managers are generally the results of fairly complex modelling processes.
This chapter has focused on establishing the basics. Our general advice is that anyone reading this book should aim to be fluent in the language of probability, be able to construct and solve basic probability problems based on business contexts of the sort that can be solved with trees and tables. Those who wish to develop their technical skills should be conversant with the application of formulas up to and including Bayes theorem.
Chapter 6 Random Variables
From random events and probabilities we now move on to random variables. We will see that random variables are simply functions that associate a number to each possible event in a sample space. This makes random variables a more restrictive form of modelling than random events. However, as is often the case, we will find that by restricting the scope for model development in clever ways, we end up with models that are much more useful.
We will find that restricting a model’s scope will extend into our application of random variables. Of the infinite varieties of random numbers we will only study a very small set that have proved useful in many contexts. In this chapter, we will discuss the binomial and Poisson as examples of discrete random variables and the uniform and normal as continuous ones.
When we get around to applying these models to problems, it may feel like we are forcing the real world to fit the model. We have two observation on this:
The first is that no model you can construct accurately reflects the world, so in some sense, reality is always being forced into a model. The real question is not: “Which model is true?”, but “Does this model help me understand the world better / make better decisions?”. Often, a simple known model is a good way to start.
The second is that once you understand how to model using the discrete and continuous distributions we will cover in this chapter, you should be able to apply these concepts to a very large class of distributions, ranging from other standard distributions (geometric, hypergeometric, Pareto, exponential) or empirical distributions you may create from data. This is done by swapping out the probability mass or density functions and substituting in the appropriate patterns. Don’t worry if you don’t know these words yet - we will get there!
Many business problems that use random variables involve either finding the probability of a well-defined event or finding a well-defined event that has a certain probability of occurring.
In a business context, it would be useful to have a model to determine the probability that 10 customers will purchase a product if the price is set to $15.00. It should be obvious how useful such a model might be in determining pricing decisions to maximize profits. Finding an event that satisfies a certain probability would be to determine how many reservations a hotel or airline should book to manage the risk of running out of space vs. having unused The usefulness of these two techniques in business should be obvious, particularly when built into automated decision making tools.
These techniques are also used as the basis for calculating confidence intervals, developing hypothesis tests, determining required sample size, etc.
6.1 Discrete vs. Continuous Distributions
There are two broad categories of random variables that we will be covering in this book: While they both allow you to determine the probability of different events, it is important to understand the difference. It will eventually become obvious to you that determining whether a situation is discrete or continuous is the first thing you need to do when modelling any situation.
Both types of variables have functions that define the probability of events. For discrete functions, the probability of a specific event is given by a For continuous distributions, the probability of events are defined by but only ever ranges of events (more on this later). Both types of functions can be aggregated to form We will generally use excel and other software tools to solve probability problems, not the functions themselves.
Discrete distributions deal with situations where the number of possible outcomes can, in principle, be continuous distributions deal with situations where the outcome must be measured. For example, you could count the number of people in a room, so the number of people in a particular room might be modeled as a discrete random variable. On the other hand, even if you knew the number of people in a room, you could not ‘count’ the weight of people in the room, a variable like weight must be measured, so the weight might be modeled as a continuous random variable.
It turns out that these differences are built into the English language. Native English speakers tend to use number or amount; many or much; fewer or less depending on whether the quantity in question is thought of as discrete or continuous, at least when they are being precise in their speaking.
| Discrete | Continuous |
|---|---|
| How many cars did you see? | How much time did it take? |
| Were there fewer people at the party this year? | Was there less food at the party this year? |
| There were a number of questions being asked. | There was a large amount of confusion. |
To be fair, sometimes things could be thought of as either discrete or continuous. Physical money is actually discrete since any quantity of it can be described as a certain number of cents. Still one might think of it as continuous when wondering how much money a job will pay. Don’t let this obscure the issue. This is really about conceptualizing the thing in question, not the math. The math follows from the conceptual model, not the other way around. You may choose to model a discrete event as a continuous one – but that is a modelling choice, not a description of reality.
6.2 Discrete Distributions
For our purposes, discrete distributions will be characterized by probability mass functions. As mentioned above, these are mathematical functions that will associate a probability with each possible numeric outcome. The standard discrete distributions such as the binomial and Poisson distributions have closed-form mathematical PMFs. Generally, to use them you need to identify certain parameters that define the functions and then specify the event.
Optional: Empirical or Special Case Functions6.2.1 The Binomial Distribution
The applies to situations that can be described as a series of a specific number of identitical, independent experiments with two possible outcomes, traditionally called success and failure. Before moving on, let’s examine each of these assumptions. We will do so in the context of the simplest possible binomial problem: Suppose you are flipping a coin 20 times. Which is more probable, that you will flip exactly 10 heads or that you will get more than 14 heads?
The specific number of experiments, typically denoted with the parameter \(n\), is fixed. This means that the number of successes, \(x\), must be between \(0\) and \(n\) inclusive. In math speak, this means \(0 \leq x \leq n\). This is an important characteristic for identifying binomial distributions because it means there is a fixed upper limit to the random variable in question.
In the coin flipping case, \(n = 20\), it is the maximum number of times you can flip a head - you could not possibly get 21 heads if you only flip the coin 20 times. You could get 0 heads, so there are 21 possible outcomes. Therefore, \(0 \leq x \leq 20\). People occasionally forget that 0 is a possible outcome, so be careful here.
The experiments have to be identical, in this case identical means having the same possible outcomes and probabilities in each trial. In the coin flipping problem, the outcomes are the same (heads and tailS) and the probabilities are unchanged on each repetition of the experiment: \(P(Heads) = 0.5\) and \(P(Tails)=0.5\).
The independence of experiments requires that the results of one experiment do not impact the results of any other. Because coins do not act differently based on how they have flipped in the past, this clearly holds true in the coin tossing experiment, but it does not hold in all repeated experiments. For instance psychological, team, or learning effects might impact how people – even different people – perform in an otherwise identical experiment.
The two possible outcomes are defined as success and failure. Which outcome is considered which is arbitrary, but once it is selected, the rest of the analysis has to be completed with that definition. In the coin flipping case, we chose ‘heads’ as a success because we wanted to know something about the number of heads. Once success is defined, we need to know the probability of success, which is a parameter, \(p\). Given that there are only two possible outcomes, \(P(Success) = 1 - P(Failure)\) by the complementary rule. Some books define \(q = P(Failure) = 1 - p\), which is convenient for formulas, though we will not be using formulas so will not need it.
Under these circumstances, the probability of the random variable, called \(X\) - note the capital here - having a specific value \(x\) - is given by the binomial PMF, \(P(X = x, n, p)\).
We will typically use Excel or other statistical tools to calculate the probability, so we only need the equations for a few characteristics:
Expected Value \(E[X] = np\)
Variance \(V[X] = np(1-p)\)
Standard Deviation \(StdDev(X) = \sqrt{(np(1-p))}\)
BINOM.DIST() function to calculate probabilities. It has the following form:
Figure 6.1: BINOM.DIST Form
Where:
Number_s is the number of successes, or the variable \(x\).
Trials is the number of times the experiment was conducted, or the parameter \(n\).
Probability_s is the probability of success on any given trial, or the parameter \(p\).
Cumulative is the tricky one. If you enter “FALSE”, you get the results from the binomial PMF. If you enter “TRUE” you get the results from the binomial CDF.
Let’s illustrate this with our coin flipping example. There were really two questions: what is the probability of getting exactly 10 heads and what is the probability of getting more than 14 heads.
The first part of the question requires that we calculate \(P(X=10)\) where \(n = 20\) and \(p=0.5\). Since we are trying to determine the probability for a specific value of the random variable, it is most convenient to use the PMF form of BINOM.DIST(). So we should enter =BINOM.DIST(10,20,0.5,FALSE) which will produce approximately 0.176.
The second part of the question requires us to calculate \(P(X \geq 15)\).
It is useful here to introduce the idea of a number line as a tool to help organize your thinking.
Optional Number Lines and CDFs To calculate \(P(X \geq 15)\) using a number line to help with our thinking:Figure 6.5: Number Line 4
In math: \(P(X \geq 15) = 1 - P(X \leq 14)\)
And in Excel: 1 - BINOM.DIST(14,20,0.5,TRUE) \(= 0.02\)
So the answer is: it is quite a bit more likely to get exactly 10 heads that it is to get more than 14 heads.
6.2.1.1 Binomial Distribution Problem
I spend an awful lot of time in corporate meetings. Our executives tend to be very busy, and at any given time there is only a 90% chance an executive shows up for a meeting on time. On Monday, my entire team is in a meeting, including 3 executives. What is the probability that at least one shows up on time? What about all 3?
Hint 1: What is the event? Hint 2: How do I calculate it in Excel?6.2.2 The Poisson Distribution
The deals with situations where things happen randomly at a given rate over time or space. The rules for the Poisson are that events occur independently over time at a constant random rate and only one event can occur at a time. As with the binomial, we will illustrate this with a typical problem, in this case, the number of times lightning strikes during a thunderstorm. Suppose that during a typical thunderstorm, lightning strikes once every 5 minutes. How likely is it that I will see 4 or more lightning strikes if I watch for 15 minutes?
The independence of events implies that that the fact that an event occurs does not influence the possible occurrence of any other event. So if lightning strikes 3 seconds into the storm, it does not impact the likelihood of lightning striking at any other time.
The constant random rate may seem like a contradiction in terms. What we mean here is that over any continuous period of time, there is an expected (i.e. average) number of times that the event will occur. We note this rate as the parameter lambda: \(\lambda\). The expected rate is for a particular event, in this case, lightning striking, over a specific period of time. In general we will write out lambda with all three components as:
\[\lambda = <Rate> of <Event> per <Unit\; Time\; or\; Space>\]
In this case, we could write out:
\[\lambda = <1><lightning\; strike> per <5\; minutes\; of\; thunderstorm\; watching>\]
This will help us keep track of the rates and events we are talking about. In the sample problem, we are not interested in what happens in a five minute period, but what happens in 15 minutes, so we need to rescale lambda to:
\[\lambda = <1*3><lightning\; strikes> per <5*3\; minutes\; of\; thunderstorm\; watching>\] \[\lambda = <3><lightning\; strikes> per <15\; minutes\; of\; thunderstorm\; watching>\]
Again, you may think that this three part description is too cumbersome, but it becomes a useful technique when the changes in units become more complicated.
Optional: Why use \(\lambda = <Rate>of<Event>per<Unit\;Time\;or\;Space>\)The restriction that only one event can occur at a time can have some significant impacts on modelling. In the case of the lightning strikes, it is probably a reasonable approximation. Since lightning strikes are instantaneous even if two appeared to be occurring at the same time, that could just be a measurement issue.
The restriction becomes more significant if we considered guests coming to a restaurant. If we modelled the arrival of individuals to a restaurant, we would violate the one occurrence at a time restriction because people often go to restaurants in groups. To get around this one could say that the first member of a group arrives an instance before the second, but then the existence of groups would violate the independence of events rule.
On the other hand, if we model the arrival of parties of individuals, it is more reasonable to think that no two can arrive at the same time and that the occurrences are independent. Whether or not this is true (or true enough) is an empirical question.
Under these circumstances, the probability of the random variable, called \(X\) - note the capital here - having a specific value \(x\) - is given by Poisson distribution \(P(X = x, \lambda)\).
We will typically use Excel or other statistical tools to calculate the probability, so we only need the equations for a few characteristics:
Expected Value \(E[X] = \lambda\)
Variance \(V[X] = \lambda\)
Standard Deviation \(StdDev(X) = \lambda^{\frac{1}{2}}\)
POISSON.DIST() function to calculate probabilities. It has the following form:
Figure 6.6: POISSON.DIST Form
Where: \(X\) is the variable \(x\).
Mean is \(\lambda\).
Cumulative, as with the binomial, Cumulative = TRUE provides the CDF or \(P(X\leq x)\) and Cumulative = FALSE produces the PMF or \(P(X=x)\).
Let’s wrap up this discussion by completing our problem. We were asked to determine the probability of seeing four or more lightning strikes in 15 min of watching a thunderstorm. We already calculated that \(\lambda = <3>\)
Figure 6.7: Number Line for Lightning Strikes
So, \(P(Four\; or\; More\; Lightning\; Strikes)\) = \(1-P(X \leq 3) =\) 1-POISSON.DIST(3,3,TRUE) = \(\approx 0.353\)
6.2.3 Solving Discrete Random Variable Problems
When faced with a modelling problem involving discrete variables, the first thing you need to do is determine what random variable you should use. Most of the time this is as much of an art as a science. Rarely does the world fit any model perfectly. More often than not, one has to make simplifying assumptions to force a model into one of the standard choices. In some cases the real world scenario might fit two or more different models.
When \(n\) is large and \(p\) is small, the Poisson distribution can be used to simulate the binomial and vice versa. To do so, one needs to be able to convert the parameters between them. We do this by setting the means equal and then solving to determine the parameters. In binomial \(Expected\; Value = np\), in Poisson, \(Expected\; Value = \lambda\), so given an \(n, p = \frac{\lambda}{n}\).
Optional Practice Problem: Fibre6.2.3.1 Choosing a Distribution
Choosing which distribution to use starts with deciding if the random variable is discrete or continuous. If it is a finite choice or collection of specific things you can count, it is probably discrete; if it is something that you must measure it is probably continuous. There may be exceptions, but this is a good place to start.
The next thing to do is determine which specific distribution it is. So far we have discrete distributions, and in that only the binomial and Poisson. While it is sometimes possible to model a single real-world problem as either a binomial or Poisson, it remains important to be able to differentiate models in general.
One key difference between the Poisson and the binomial is that with the Poisson there is no fixed upper limit to how many times an even can occur.
The process of selection is shown in the following diagram:Figure 6.8: How to pick a distribution
Having decided which model might apply, you then have to confirm that the assumptions are either strictly true or close enough to justify developing the model. This is only the first step in the modelling process. The whole process goes like this:
6.2.3.2 Generic Approach to Solving Distribution Problems
- Determine type of problem and associated distribution.
- Code the problem (what do you know and what is relevant).
- Identify what you are trying to find (e.g., \(P(X>3)\)).
- Draw a diagram (e.g., normal distribution or number line as appropriate).
- Translate what you are trying to find into a form you can compute with a formula or look up in Excel.
- For Binomial, you need to translate \(P(X>4)\) into \(1-P(X\leq 4)\).
- For Normal, you need to translate into Z values and do an Excel lookup.(More on this in a minute!)
- Solve to address your question (from part 3), check it for reasonableness, then report and explain your answer.
6.3 Continuous Distributions
From discrete distributions we move on to continuous ones. The key characteristic here is that individual values have zero probability, only ranges of values have probability. We will begin by looking at the uniform distribution because of its ease of use. From there we will move to the normal distribution, which is the most important distribution in statistics.
6.3.1 Uniform Distributions
The is characterized by two parameters \(a\) and \(b\) which define the minimum and maximum values of random variables that can be produced. We typically denote the uniform distribution with the notation \(U(a,b)\) where \(a\) and \(b\) define the range. Within this range, equal-sized events have equal probability of occurring. The distribution’s PDF is:
\[ f(x) = \begin{cases} \frac{1}{(b-a)}\: if\: a \leq x \leq b \\ 0\: if\: otherwise \end{cases}\]
Which looks like:Figure 6.9: Uniform Distribution
As with the other distributions we have seen, the uniform has many other statistical characteristics, but the only two we will tend to use are its \(Expected\; Value = \frac{a+b}{2}\) and \(Variance = \frac{(b-a)^2}{12}\).
The uniform distribution does not occur frequently in real world problems. It does come up occasionally in situations where only the range of possible outcome is known. For example, if you knew that a fire alarm test was going to happen between 9:00 and 10:00 AM but not when, you might think of a uniform distribution. You can think of this approach as applying in situations where you are maximally ignorant.
The beauty of the uniform distribution is that it allows us to calculate probability as area without using any software or complex formulas. You may recall that the area of a rectangle is given by the base times the height. The height of the uniform distribution is always \(\frac{1}{(b-a)}\) so calculating the probability of a continuous event within \(a\) and \(b\) is simply a matter of determining the length of the base and multiplying it by \(\frac{1}{(b-a)}\). An example will make this clear.
Suppose you know that a fire alarm test will occur between 9:00 and 10:00 AM, but not when. You have a 10 minute meeting scheduled for 9:15 – 9:25, what is the probability that the test will occur during your meeting?
To solve this problem, we need to determine \(a\) and \(b\). Since the problem is scaled in minutes, we may as well use minutes for scaling \(a\) and \(b\). So we will define \(a = 0\), \(b = 60\) then the height is \(\frac{1}{60}\). With this, the event becomes ‘between minutes 15 and 25 inclusive’ which becomes the event \(15 \leq X \leq 25\).
So, we are looking for \(P(15 \leq X \leq 25)\) on a Uniform distribution \(U(0,60)\):
\[P(15 \leq X \leq 25) = \frac{25-15}{60} = \frac{10}{60} = \frac{1}{6}\]
It is that easy!
The key takeaway here is to begin thinking of probability as the area under a curve. If you develop it, this intuition will serve you well with other, more complex distributions, such as the normal which we turn to now.
6.3.2 The Normal Distribution
The normal distribution is a class of distribution for continuous random variables. It has two parameters: the mean, represented by mu (\(\mu\)) and the standard deviation represented by sigma (\(\sigma\)). It is typically denoted \(N(\mu,\sigma)\). It has a symmetric bell-shaped distribution with its mean at \(\mu\) with tails that extend from \(-\infty\) to \(+\infty\). Probabilities are measured as areas under the curve – though we will use Excel to calculate them.Figure 6.10: Normal Distribution
The normal distribution is very important in statistics because approximately normal distributions occur in many real-world problems. As with the binomial, Poisson and uniform, the normal distribution is actually a whole family of distributions, each of which is defined by a particular combination of its parameters.
In the case of the normal, one particular version is particularly important, the standard normal. It has \(\mu =0\) and \(\sigma = 1\). The standard normal is often referred to by the random variable \(Z\). So, \(Z\sim N(0,1)\) by using the Z-score: \[z = \frac{x-\mu}{\sigma}\] And back by solving the z-score for x to get: \[x = \mu + z\sigma\]
The z on the standard normal is equivalent to the point \(x\) in the sense that \(P(Z<z) = P(X<x)\) for any point. In other words, all the probabilities associated with the events are the same.
As with the binomial and Poisson, you need to use a computer to calculate these probabilities. Excel has several functions that will do this calculation for you. In particular, there is one for an arbitrary normal, and one for the standard normal. They are NORM.DIST and NORM.S.DIST, respectively.
Figure 6.11: Normal Distribution in Excel
Figure 6.12: Standard Normal Distribution in Excel
The parameter cumulative is used to indicate that you want the CDF - the cumulative distribution function. This will almost always be true for your purposes.
Before we go on, let’s illustrate how we solve a problem with a normal distribution.
My oven is not working properly. If I set the temperature to 350 degrees, the actual temperature will be normally distributed with a mean of 350 degrees and a standard deviation of 10 degrees. What is the probability that my oven will actually be above 365 degrees if I set it at 350? What is the probability that the temperature will be within 5 degrees of the target temperature?
We should follow the
Normally, we would start with determining the distribution, identifying what is relevant and determining what we want to find – typically an event or a probability of an event. Here these are spelled out in the problem. The distribution is \(X \sim N(350,10)\), which contains the parameter values we need. We are trying to find two probabilities: \(P(X>365)\) and \(P(345 < X < 355)\).
The diagrams (Step 4) are shown below with the probabilities highlighted in red.
Figure 6.13: Calculating Probabilities using Normal Distribution Diagrams
Step 5 is to translate this into a form that you can compute with a formula or look up in Excel. In this case, we will use Excel. For \(P(X>365)\) we need to calculate \(1-P(X \leq 365)\) because all CDFs start aggregating probability from the left and move to the right. In Excel, we can do this calculation using the general normal distribution is =1-NORM.DIST(365,350,10,TRUE) \(\approx 0.0668\) or we can use the Standard Normal through the
where \(z = \frac{x - \mu}{\sigma} = \frac{365-350}{10} = 1.5\). Having found the standard normal equivalent to \(x\), we can lookup \(P(Z>1.5)\) =1-NORM.S.DIST(1.5,TRUE) which produces the exact same answer as above.
Figure 6.14: Calculating Probabilities using Normal Distribution Diagrams
So we need to calculate =NORM.DIST(355,350,10,TRUE) - NORM.DIST(345,350,10,TRUE) \(\sim 0.3829\). You should also be able to do this using the standard normal. See if you can before looking at the solution below.
In the introduction to this chapter, we said:
“Many business problems using random variables involve either finding the probability of a well-defined event or finding a well-defined event that has a certain probability of occurring.”
NORM.INV and NORM.S.INV.
Figure 6.15: Normal Distribution Inverse in Excel
Figure 6.16: Standard Normal Distribution Inverse in Excel
You are probably getting the idea by now, but let’s do one last example. Suppose you would like to know the temperature so that there is only a 5% chance that the oven will exceed that temperature.
Graphically we would be looking for \(x\) such that \(P(X<x) = 0.05\)Figure 6.17: Standard Normal Distribution Inverse in Practice
Once again, the excel functions will give us events starting on the left, moving to the right, so we actually need to find \(P(X<x) = 0.95\) and then take the complement of that event. In Excel this is found using the function =NORM.INV(0.95,350,10) \(\sim 366.449\). So, \(x\) is about 366.45 degerees.
Try repeating this calculation using the standard normal.
Hint 1: How do I get started? Hint 2: What is the formula? Hint 3: I have the Excel output - now what? Optional: Technical Detail on Parameters as VariablesYou may be thinking that using the z-score is an unnecessary complication. For these problems, you are right. In the old days, this was a necessity because people had to use tables to find the values for normal distribution problems. Naturally this is no longer the case, though many text books still have them – but not ours!
Thinking in terms of the z-score is still helpful, and we encourage you to be familiar with it because it is useful for a number of calculations we will do shortly, and it is essential when calculating the t-distribution. But that can wait for the next chapter.
6.4 Managerial Discussion
There are three critiques that one frequently hears regarding mathematical models: 1. The models are not realistic 2. When am I going to use this 3. Do these work in practice.
While we do not feel the need to defend the field of study, here are some thoughts on the matter.
6.4.1 Are they realistic?
The first thought might be: realistic as opposed to what?
If an organization is not using any modelling, then these are probably a better starting point than most guesses that one could come up with in the absence of a model. At least they can form the basis for continuous improvement by standardizing the approach and making clear predictions. All these techniques have ever promised to do is provide a bit of insight, they were not meant to solve the problems for you.
As for the strong assumptions that are made, these can often be relaxed with minimal effort. For example, you might want to determine staffing requirements for a restaurant using a Poisson distribution of customer arrivals. This would say that you need a constant lambda for the planning period, when you know full well that customers arrive at a greater rate between 11:45 and 1:30 than between 1:30 and 5:00. If that is the case don’t be frustrated that the model requires a single lambda, use two models rather than one.
For reasons we will discuss in the next chapter, the normal distribution is exceptionally useful as an approximation for many problems.
6.4.2 Will I use them?
Since you are reading this text, you are probably taking a course in analytics or artificial intelligence. As a matter of the course content, you will want to learn how to identify and use these specific models. That said, as authors, we see this less as a process of learning about these specific models and more as a first step into learning to model.
The steps we have taken in applying these models apply to most model thinking. We identified a standard way of thinking about a problem that could be leveraged to provide some useful insight. We then used the data to configure the model – granted it was only finding \(\lambda\), \(n\) and/or \(p\) – but still, those were empirical issues. Lastly, we thought about how to ‘talk to the model’ by providing an event whose probability we wanted or a probability whose event we wanted.
These are basically the steps in many modelling situations, so building the practice of modelling thinking is valuable even if you do not use any particular model. In particular, we have used techniques that are not too much more advanced, along with empirical distributions, to create simulations for professional planning models that have saved hundreds of thousands or even millions of dollars (sadly, not on a commission basis.)
6.4.3 Are they used in practice?
As to the specific content of these models, these models form inputs into some real world modelling techniques. Poisson distributions can be used in the study of waiting and capacity (queuing theory); as inputs to more complex models to predict count data, such as the number of customers as input to planning models.
The logic of binomial distributions arise as the basis for other pieces of analysis in analytics and AI context.
The normal distribution, because it arises so frequently in real world phenomena is used extensively in forecasting, inventory planning, quality control, and a wide variety of problems.
Chapter 7 CLT and Applications of the Normal Distribution
This chapter is all about the usefulness of the normal and t-distributions. We begin by describing the Central Limits Theorem which will establish that normal distributions arise from processes that average or sum other random variables.
Given that the normal distribution occurs all the time, we will build on the material covered in the previous chapter to make some probability statements that extend the two probability problems we described in the last chapter, namely, finding a well-defined event that has a certain probability and finding the probability of a well-defined event.
We will then ‘solve’ the z-score equation for probability in two different ways to provide: 1. A confidence interval for a forecast 2. The sample size required to draw a particular inference
We will then introduce the t-distribution. As you will see, the t-distribution occurs when you would otherwise have a normal distribution but you do not know the standard deviation and therefore have to estimate it from the data. You will find that, in most cases, using the t-distribution rather than the standard normal does not change the way you think about things, makes a very small difference to the numeric results, and is typically handled by the software behind the scenes.
7.1 The CLT
The central limits theorem establishes that, when random variables are averaged or summed, the distribution of the sampling mean of that distribution will become closer and closer to the normal distribution as the sample size gets larger. Just to be clear, the actual distribution of the sample does not become more normal, but the ‘sampling distribution of the sample mean’ becomes more normal
7.1.1 What the CLT does not say
My experience is that many people make serious mistakes regarding what the CLT actually says about data. Let me illustrate what it DOES NOT SAY with an example.
If you took a sample of \(n > 30\) employees at a particular firm, the taken by those employees would probably not be normal. As discussed earlier, you could get a sense of the distribution by creating a histogram, though you would certainly want more than \(n = 30\) to get a real sense of the data.
My guess is that if you did build this histogram you would get a lognormal or exponential distribution of sick time by employees. I’d expect this because I suspect that many employees take a relatively small number of sick days due to seasonal illness; a few will require longer periods of sick time due to serious illness or other medical problems.
This prediction was born out by one website that provided some data on the topic here with \((n=92)\):Figure 7.1: Histogram of Sick Days
this is a population, not a sample, but if we think of the company as a DGP, it is perfectly sensible to think of this population as being but one of many populations that could have been generated by the process. With that understanding, it is sensible to think of the population as a sample from which we could attempt to infer characteristics of the DGP.
At a glance this data looks lognormal to me, though I did not formally test it. The key point here is that the distribution is not normal, the CLT DOES NOT say that it should be. The CLT DOES NOT say anything about the shape of the original data.
7.1.2 What the CLT does say
Now, suppose you took an average of \(n=49\) from this population. The CLT says that the mean of that sample, \(\bar{x} = \frac{1}{n} \sum_{i=1}^{n}x_i\), or the sum of that sample as \(Sum_x = \sum_{i}^{n}x_i\) will be asymptotically normal with the following distributions: \[\bar{x} \sim N^a (\mu_\bar{x} = \mu_x, \sigma_\bar{x}=\frac{\sigma_x}{\sqrt{n}})\] and \[Sum_x \sim N^a (\mu_{sum_x}=n\mu_x, \sigma_{sum_x}=\sqrt{n}\sigma_x)\] This fact is demonstrated in the excel file that shows a histogram of 10,000 samples of shown below. You can see that this distribution looks pretty close to a normal distribution.Figure 7.2: Histogram from 10,000 samples
Some of the apparent deviation from normal is a result of the histogram only having 10,000 observations. If we did this with more observations, the histogram would tend to look smoother.
As a final comment, since we had access to the population, not a sample, we can calculate its mean and standard deviation as \(\mu = 51.864\) and \(\sigma = 47.982\) respectively, so we know the normal from which these distributions are drawn has a mean of 51.864 and a standard deviation of \(\frac{\sigma}{\sqrt{n}}=\frac{47.982}{\sqrt{36}}=7.997\). Or, in notation:\(\bar{x} \sim N^a(51.864,7.997)\).
So as a consultant to an HR department for a large firm, you might want to know how many sick days employees take on average. This average would not tell you anything about an individual’s likely sick time, but it could help with staff planning in general. The CLT helps with average results, not with individual results. To understand something about the individual results, you would need to look at the histogram of actual sick days.
If you would like to see how this works in practice, download the Excel file here.
Download SickTimeData_CLT.xlsx7.1.3 Summary of CLT for means
In summary, for most situations where we have samples of \(n > 30\), the CLT ensures that the distribution of \(\bar{x}\) is approximately normal and follows the distribution \(\bar{x} \sim N^a(\mu_\bar{x} = \mu_x, \sigma_\bar{x} = \frac{\sigma_x}{\sqrt{n}})\). This tells us that there is a corresponding z-score formula for means, which is: \[Z = \frac{\bar{x} - \mu_\bar{x}}{\sigma_\bar{x}} =\frac{\bar{x} - \mu_x}{\frac{\sigma_x}{\sqrt{n}}}\] This turns out to be a very useful formula!
7.1.4 CLT and Proportions
In a previous chapter, we introduced the idea of the dummy variable and that we could define being male as a dummy variable \(Male = 1\) for all the males in a group and \(Male = 0\) for the rest. We also noted that the proportion \(\frac{1}{n}\sum^n_{i=1} male_i\) equals the probability that a randomly selected individual is male. If you look at the proportion or probability formula, you can see it is just a special case for the averaging formula used in motivating the CLT - so, a CLT should apply to these values as well (and it does).
Since this is a special case of the CLT, a special set of rules apply. Specifically, if we define \(d\) to be a dummy variable, \(d=1\) if the characteristic holds, and \(d=0\) otherwise, then the proportion \(\hat{p} = \frac{1}{n}\sum_{i=1}^n d_i\) is an estimate of the population proportion \(p\) in the same way that \(\bar{x}\) is an estimate of the population mean \(\mu\).
If the sample size, \(n\) is large enough that \(np >5\) and \(n(1-p) > 5\) then we can invoke the CLT and say that: \[\hat{p} \sim N^a (\mu = p, \sigma_\hat{p})=\sqrt{\frac{p(1-p)}{n}}\] Which means that under many circumstances we can treat estimates of proportions or probabilities as having asymptotically normal distributions. This will be incredibly useful in making predictions, testing hypotheses, and a variety of other techniques.
These formulas rely on the population proportion, \(p\), not the sample proportion \(\hat{p}\), so we either make assumptions or use \(\hat{p}\) in most cases.
7.1.5 Summary of CLT for Proportions
In summary, when calculating sample proportions, we are really just calculating the mean of a dummy variable. As a result, the CLT applies if \(n\hat{p} >5\) and \(n(1-\hat{p}) >5\). When the CLT applies, we know that \(\hat{p} \sim N^a (\mu = p, \sigma_\hat{p})=\sqrt{\frac{p(1-p)}{n}}\). This tells us that there is a corresponding z-score formula for sample proportions! It is: \[Z = \frac{\bar{x}-\mu_\bar{x}}{\sigma_\bar{x}} = \frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}}\] This also turns out to be a very useful formula!
In summary, given that the CLT justifies the use of the normal distribution (and later the t-distribution) we now turn to how we can use it.
7.2 Forecasting and Predicting
Generally, there are two types of forecasts. The first is the point forecast, which propose a specific number for some unknown value. While they can be incredibly useful for planning purposes, point forecasts, even when done properly using all available information in the best possible way, are virtually always wrong.
With continuous random variables, this occurs for the technical reason that there are an infinite number of continuous values that that the true value could be. Even with discrete random numbers, such as your earnings last year in dollars and cents, the number of possible values is so large that it is unlikely that the true value will be selected.
The other type of forecast is a range forecast. Range forecasts, as the name suggests, propose a range of values the true value is likely to lie within. The problem with range forecasts is selecting the range: make it too narrow and it is unlikely to be true; make it too wide and it is unlikely to be useful. So analytics and AI applications that generate forecasts have invented a way to make this tradeoff. This tradeoff is formally captured by forecast accuracy.
Forecast accuracy is normally described in percentage terms, using a statement like ‘this forecast is 95% accurate’, which means, Generally we will use the letter \(\alpha\) (alpha) to describe forecast accuracy as \(1-\alpha\). By this convention, \(\alpha\) is the probability that the forecasting process generates a false range, which is to say, one that does not contain the true value.
As we saw in the previous chapter, we can use the to find a well-defined event that has a specified probability of happening. Suppose we want to find the range of values, around some unknown population parameter \(\mu\) such that 95% of the time, a random value drawn from the normal distribution will fall in that range. To do so, we could find a value on the standard normal that is so small that only \(\frac{\alpha}{2} = 0.025\) of the time we would find a smaller value and another value so that only \(\frac{\alpha}{2} = 0.025\) of the time the value would be larger.
The smaller value is easy to find using the NORM.S.DIST function in Excel to obtain: =NORM.S.INV(0.025) = \(-1.9596 \sim -1.96\)
Since the normal distribution is symmetric around the mean, the upper value is 1.96. So, \(P(-1.96 < Z < 1.96) = 0.95\). We could in general repeat this process with any other value of \(\alpha\) to find an upper and lower bound: \(Z_{LB}\) and \(Z_{UB}\) that would satisfy the equation: \[P(Z_{LB} < Z < Z_{UB}) = 1 - \alpha\] Given the criteria for the CLT are met, we can take the mean of any sample and say: \[\bar{x} \sim N^a (\mu_\bar{x} = \mu_x, \sigma_\bar{x} = \frac{\sigma_x}{\sqrt{n}})\] Better yet, given the Z-score from the previous chapter, we can calculate that: \[P(Z_{LB} < \frac{\bar{x}-\mu}{\sigma_\bar{x}} < Z_{UB}) = 1 - \alpha\] Finally, it takes a few steps and involves some substitutions, but this can be solved to produce: \[P(\bar{x} - Z_{UB} \sigma_\bar{x} < \mu < \bar{x} + Z_{UB} \sigma_\bar{x}) = 1 - \alpha\] This can be written more neatly as: \[\bar{x} \pm z_{\alpha / 2} \sigma_\bar{x} \; or \; \bar{x} \pm z_{\alpha / 2} \frac{\sigma_x}{\sqrt{n}}\] and is the formula for the \((1-\alpha)\) confidence intervals for means. If we don’t know the \(z_{alpha / 2}\) value, we can find it using Excel’s function NORM.S.INV for any probability \(\alpha\) by substituting in \(1-\frac{\alpha}{2}\).
Following a similar logic, the confidence interval formula for proportions is given by: \[\hat{p} \pm z_{\alpha / 2} \sqrt{\frac{p(1-p)}{n}}\] This formula presents an obvious challenge - if we knew \(p\) (the population parameter) we would not need to estimate it. This is resolved by using \(\hat{p}\) insetad of \(p\) in the formula.
Let’s try applying them. You will quickly see they are much easier to use than to derive.
Sample Question A sample of 36 customers is randomly selected and asked their age and gender. The average age is 28.32 years and 15 of them are male. Assuming the standard deviation of age is known to be 2.3 years:
- Calculate the 95 % confidence interval on the population age
- Calculate the 99% confidence interval for the proportion of males.
Answer
/
1 Since \(n>30\), the CLT applies. Then, the confidence interval for the population mean is \(\bar{x} \pm z_{\alpha / 2} \sigma_\bar{x}\). From the question, we know the sample size (\(n=36\)), the mean (\(\bar{x}=28.32\)), and the standard deviation (\(\sigma_x = 2.3,\; so\; \sigma_\bar{x} = \frac{\sigma}{\sqrt{n}} = \frac{2.3}{\sqrt{36}} \approx 0.3833\)). Lastly, we
\(z_{alpha / 2} \approx 1.96\) for \(\alpha = 0.05\) because we looked it up a few minutes ago. If we did not, we could use the NORM.S.INV command in Excel and plug in \((1-\alpha / 2)\) to find the upper bound.
Plugging these values into the formula we get:
\[28.32 \pm 1.96 \frac{2.3}{6} = 28.32 \pm (1.96)(0.3833) = 28.32 \pm 0.7513\]
2 For the proportion male, we first note that \(np >5\) and \(n(1-p) > 5\) so the CLT applies and the confidence interval for proportions is giving by: \[\hat{p} \pm z_{\alpha / 2} \sqrt{\frac{p(1-p)}{n}}\] Since there are 15 males in a group of 36 people, the proportion of males in the group is \(\hat{p} = \frac{15}{36}, n=36\). The only other information we need is \(z_{\alpha / 2}\). In this case, we want the 99% confidence interval. This means \(1-\alpha = 0.99, \; and\; \alpha = 0.01\). We can use the Excel formula NORM.S.INV with the value \(1-\frac{\alpha}{2} = \frac{0.01}{2} = 0.005\) to get the value \(\approx 2.576\). The rest is just plugging into the formula: \[\hat{p} \pm z_{alpha / 2} \sqrt{\frac{p(1-p)}{n}} = \frac{15}{36} \pm 2.576 \sqrt{\frac{\frac{15}{36}(1-\frac{15}{36})}{36}} \approx 0.417 \pm 0.212\]
Now, you try a problem. Continuning the story from above, what is the 90% confidence interval for age?
Hint 1: What story are you talking about? Hint 2: What is the formula for confidence intervals? Hint 3: How do I calculate \(z_\alpha\)? Hint 4: What is the answer?It turns out that this formula, and ones closely related to it are used for confidence intervals for all sorts of forecasts when the forecast is unbiased with errors that tend towards normal under a CLT.
7.3 Determining Sample Size
The confidence interval formula we derived above came from the application of the z-score formula to a probability statement. It turns out that we can solve that relationship in one other way to provide a very useful bit of information.
Suppose you were trying to determine whether offering a discount of $100 would turn into a sufficient increase in sales to justify the cost of the promotion. You could collect a sample of \(n\) prospective customers, offer them the discount and then estimate the number that purchased. The result would be an estimate of the population parameter – but that is always unknown.
The question is, how many people should you sample? If you sample too few, you will not be able to trust your answer; too many and you may be wasting money on a promotion that doesn’t work. This is one example of a general problem that involves determining required sample size. As we will discuss in hypothesis testing, this relates to issues of power.
To solve the problem, we need to answer two managerial questions: How accurate do you need the estimate to be and how much confidence do you need in that estimate?
The issue of accuracy is addressed by defining an error tolerance between the estimated value and the true but unknown population parameter. We can define this error tolerance as \(E = \bar{x} - \mu\). The confidence is defined in the same way we did above, we can be 95% confident by tolerating a 5% chance of error. In general, we are \(1-\alpha\) percent confident by tolerating an \(\alpha\) chance of mistake; \(\frac{\alpha}{2}\) of the time we will be too high and \(\frac{\alpha}{2}\) we will be too low.
If we return to our Z-score formula for continuous variables, picking the particular \(z = z_{\alpha / 2}\) because it satisfies the probability statement that \(P(-z_{\alpha /2} < Z < z_{\alpha / 2}) = (1-\alpha)\): \[z_{\alpha / 2} = \frac{\bar{x}-\mu_x}{\frac{\sigma_x}{\sqrt{n}}}\] If we substitute in our error tolerance \(E = \bar{x} - \mu\) we get: \[z_{\alpha / 2} = \frac{E}{\frac{\sigma_x}{\sqrt{n}}}\] Which we can solve for \(n\) to obtain: \[n = (\frac{z_{\alpha / 2}\sigma_x}{E})^2\] Which is the sample size required to achieve an error bound of \(E\) with confidence \(1-\alpha\). A similar process can be applied to the z-score for proportions to provide the sample size required to achieve an error bound on proportion estimates. The resulting formula is: \[n = (\frac{z_{\alpha / 2}\sqrt{p(1-p)}}{E})^2\]
7.3.1 Using Sample Size Formulas
Before we provide an example, there are a few things to note about this formula.
These formulas typically produce decimal numbers which need to be rounded up to the next largest integer. Beyond this, these are each based on a CLT so the sample sizes must be large enough to justify the CLT in the first place. If they are not large enough, you must increase them. For means, this implies \(n \geq 30\), for proportions, \(np\) and \(n(1-p) > 5\).
They are also based on parameter estimates that you do not typically have. The sample size for means requires an estimate of the standard deviation, \(\sigma\) to get started, and proportions requires the population proportion.
Either the standard deviation or the population proportion could be estimated by a For proportions, there is another option. It turns out that the required sample size is maximized by setting \(p=0.5\), so one could be ‘conservative’ by plugging in \(p=0.5\). The choice of 0.5 can make a huge difference in the results if the proportion is close to 0 or 1.
Let me demonstrate with an example.
Example: As a frequent flyer, I always felt that airlines underestimated the length of flight delays. It seemed to me that they would typically announce something like a 30 minute delay but the plane would actually be delayed 40 minutes. At a guess, these extended delays would happen nine out of ten times.
I plan to write a nasty letter to the airline. To justify my case, I want to collect sample data on whether delays take longer than they claim or not. How large a sample would I need to get an estimate within 2% of the actual value with 95% confidence? Provide two estimates one based on my prior belief, the other a conservative estimate.
Solution The problem involves determining the sample size for a proportion. The error bound is \(E=0.02\), \(\alpha = 0.05\), so \(z_{\alpha / 2} = 1.96\). The question also tells us to produce two estimates. A belief-based one which would use \(p=\frac{9}{10} = 0.9\) and a conservative one which would use \(p=0.5\). All that remains is to plug in the values into their respective formulas.
Based on my prior believe we would get: \[n = (\frac{z_{\alpha / 2}\sqrt{p(1-p)}}{E})^2 = (\frac{1.96\sqrt{0.9(1-0.9)}}{0.02})^2 \approx 864.36 \Rightarrow 865\] To be conservative we would use: \[n = (\frac{z_{\alpha / 2}\sqrt{p(1-p)}}{E})^2 = (\frac{1.96\sqrt{0.5(1-0.5)}}{0.02})^2 = 2401\] Being conservative would require a sample nearly 3 times as large.
Now you try a problem.
Suppose I have collected a sample of data and estimated the actual delay of planes to have a standard deviation of 20 minutes. How large a sample would I need to estimate the population average delay time within 5 minutes with 98% confidence?
Hint 1: What was the formula for determining sample size? Hint 2: Why do you keep changing the confidence level and what is the \(z_{\alpha / 2}\) with a 98% confidence? Hint 3: What is the answer?7.4 The t-distribution
There has been an unrealistic element to many of the problems we have been solving lately. We have been assuming we know the population’s standard deviation when it is obvious that we do not. In most cases we could estimate it, but the CLT says we need to know the standard deviation, not an estimate of it – because the estimate would add additional error (or sloppiness as we like to say) to the distribution. Fortunately, that issue is resolved with the t-distribution.
The t-distribution occurs when you would have a standard normal distribution except that you had to The t-distribution, like all distributions is actually a family of distributions, in this case, with a parameter for the degrees of freedom, often noted \(df\) which is based on the sample size. \(df = n - 1\).
The t-distribution looks like the standard normal except it is a bit shorter and wider. The smaller the sample, the lower the degrees of freedom, the shorter and wider the distribution looks. As the sample size gets larger, the distribution looks increasingly like the standard normal. In the limit, as the sample size approaches infinity, the t-distribution converges to the standard normal. In practice, after \(df > 500\),
To use the t-distribution in Excel you use should use the functionsT.DIST and T.INV,
NORM.S.DIST and NORM.S.INV respectively.
Figure 7.3: Excel Parameters for T.DIST
Figure 7.4: Excel Parameters for T.INV
The only real difference between these functions is that they require that the degrees of freedom be entered in the calculation.
7.4.1 What changes the t-distribution?
Now that we have the t-distribution, we can update our z-score function for means to: \[t = \frac{\bar{x} - \mu_\bar{x}}{S_\bar{x}} = \frac{\bar{x} - \mu_x}{\frac{S_x}{\sqrt{n}}}\] The confidence interval for means becomes: \[\bar{x} \pm t_{\alpha / 2, df=?} S_{\bar{x}}\] or \[\bar{x} \pm t_{\alpha / 2, df=?} \frac{S_{x}}{\sqrt{n}}\] For technical reasons we will not get into here, the formulas for sample size determination and the calculations for proportions do not change.
Now you try a problem.
A sample of 36 customers is randomly selected and asked their age and gender. The average age is 28.32 years and 15 of them are male. Assuming the standard deviation of age is estimated to be 2.3 years. What is the 90% confidence interval for the population age?
Hint 1: What is the formula? Hint 2: How do I calculate \(t_{\alpha / 2}\) and the degrees of freedom? Hint 3: What is the answer?So we are 90% confident that the true mean is between 27.57 and 29.07 years of age.
You can see that moving to the t-distribution changes very little about how to approach the problem. The answer does result in a slightly wider confidence interval. This is because of the additional uncertainty introduced by estimating rather than knowing the standard deviation of the population. Still the impact even with a small sample is very minor.
7.5 Managerial Discussion
The CLT and t-distribution allow us to make incredible progress towards making the tools we have been developing applicable for real world problems. Jointly they have shown that we can use the normal and related distributions, leveraging characteristics we have estimated from the data itself, to make realistic probability statements about many real world processes. This is obviously quite powerful and will eventually form the basis of hypothesis testing, forecasting, simulation, and related fields.
Beyond this, the material in this chapter is quite foundational for a number of aspects of analytics and AI work. While these are some immediately useful applications, the principles involved are important beyond their obvious application. The logic of confidence intervals and sample size requirements are key to understanding what analytics and AI tools can and cannot do. You cannot be an informed consumer of these tools without at least a basic understanding of these issues.
Generally you will find the ability to model is limited by three factors, the number of observations you have \(n\), the scope of the variables you have for each observation \(k\), and the non-sample data you can bring to the modelling process. We will discuss the importance of these factors later in the text. For now we will note that the logic of size determination shows that potentially a lot of data may be required to develop precise estimates of certain values, particularly where probabilities / proportions are concerned. While additional scope and non-sample information can help, this problem really doesn’t go away with more sophisticated tools because it is an issue of the information content of data.
Of particular note is that these formulas, with a bit of effort, show that more data is always useful, but its usefulness grows at an ever-decreasing rate. At some point, your ability to make predictions or estimate the impact of causal factors are limited by the samples you have. And your samples may be limited by cost, time or even the stability of the data generating process.
Confidence intervals are an exceptionally important and under-appreciated aspect of analytics & AI. Organizations are full of predictions, forecasts and other forms of estimates. It is tempting to use these forecasts to provide insight in support of decision making, but any forecast without a measure of confidence, such as a confidence interval, does not contain enough information to be worthy motivating action.
Finally, as we will see in upcoming chapters, this material collectively forms the basis for hypothesis testing. You will find that hypothesis testing provides a clear and logical way to link data and business acumen to real-world decision making.
Chapter 8 Probability Distribution Practice Problems
Below are some sample problems to help you practice what we have learned about probabilities, probability distributions, and the tools used to calculate probability. In general, they get harder as we progress. Sample solutions are available as part of the Excel companion.
8.1 Songs
I have a list of 10 songs stored on my phone, in shuffle mode it plays songs completely at random. I only like three of the songs (it is Tracy’s playlist!). If I play 5 songs, what is the probability that at least four of them are songs I like?
Hint 1: How do I get started?Hint 2: What is the event?
Hint 3: How do I calculate the probability?
8.2 Boots
My wife has been admiring a particular pair of boots on Amazon.ca. She tells me that the price changes every day but that she thinks it is normally distributed with a mean of $300 and a standard deviation of 30. Apparently the price this morning was $240. She asked me to calculate the probability that she will see a lower price in the next five days. What should I tell her?
Hint 1: How should I get started?Hint 2: What is the event?
Hint 3: How do I calculate the probability?
8.3 Deck Staining
I wanted to stain my deck the other day. Apparently one should only do this if you are confident that it will not rain for 24 hours after applying the stain. The weather forecast said there would be a 0% chance of precipitation for each of the first 20 hours and a 20% chance for each hour for the final four hours in that 24 hour period. The weather forecast is leaving out some important information regarding the events “rain in hour 21”, “rain in hour 22”, etc. Specifically, it does not say whether the events are mutually exclusive, have independent probabilities or have some other form of conditional probability.
8.3.0.1 a. Explain in language a manager would understand what it would mean if: 1. Raining in each of the last four hours were mutually exclusive. 2. Raining in each of the last four hours were independent. 3. Raining in each of the last four hours were neither mutually exclusive nor independent events.
8.3.0.2 b. I decided to go ahead and stain the deck. Assume that raining in each of the last four hours were independent events, what is the probability that my newly-stained deck will get rained on?
Hint 1: How do I get started?Hint 2: What are the events?
Hint 3: How do I calculate the probability?
8.4 Consultant Travel
As a manager in a consulting company, I have to fly into the client’s site from Austin every week. Two of my consultants have to fly in as well, one from Boston, the other from Chicago. All three flights are scheduled to arrive at 10:00 AM. The flight from Austin is late 5% of the time; the flights from Boston and Chicago are late about 10% and 20% of the time, respectively. Assume that the arrival times are independent.
8.4.0.1 a. Assuming the consultant from Chicago is on time, what is the probability that she will be waiting for exactly one person?
Hint 1: The question says the Chicago flight is on time; how does this change the probability?Hint 2: What formula do I need to calculate the probability?
8.4.0.2 b. What is the probability that the consultant from Boston will be late for at least 4 of the next 8 arrivals?
Hint 1: How do I get started?Hint 2: What is the event?
Hint 3: How do I find the probability in Excel?
8.4.0.3 c. Is the assumption of independent arrival times a reasonable one? Explain why / why not.
Hint 1: What factors should I consider?8.5 Home Renovations
This actually happened: two workers from unrelated companies are supposed to show up at my house sometime this morning: an appliance repair person and an electrician. The appliance repair person claims she will arrive between 8:00 AM and 12:00 noon. The electrician claims he will arrive between 10:00 AM and 12:00 noon. Other than this information, I have absolutely no idea when they will arrive, all I know is that I have a phone call from 10:00 to 11:00 AM and it is currently 7:30 AM.
8.5.0.1 a. As described in the story above, are the events of each person’s arrival independent, mutually exclusive or somehow dependent? Explain your reasoning.
8.5.0.2 b. What is the probability that the appliance repair person will arrive before my phone call starts?
Hint 1: How do I get started?Hint 2: What is the event?
Hint 3: How do I calculate the probability?
8.5.0.3 c. What is the probability that both of them will arrive while I am on the phone?
Hint 1: How do I get started?Hint 2: What is the event?
Hint 3: How do I calculate the probability?
8.5.0.4 d. Suppose that as of 11:00 AM, neither have arrived, what is the probability that at least one will arrive in the next 10 min?
Hint 1: How do I get started?Hint 2: What is the event?
Hint 3: How do I calculate the probability?
8.6 Photo Radar
The government of Ontario decided it would try to raise money by installing photo radar cameras on the 401 highway across the top of Toronto. The camera would be set to record the speed of those travelling more than 135 km/h and send them a ticket for $200. It would only run the program between the hours of 6 and 7 PM, during which time it would use radar on exactly 2,000 cars, of which about 1% would be ticketed.
8.6.0.1 a. What is the probability that more than 25 cars get ticketed?
Hint 1: How do I get started?Hint 2: What is the event?
Hint 3: How do I calculate the probability?
8.6.0.2 b. What is the expected revenue of the government, assuming all tickets get paid?
Hint 1: How do I get started?8.6.0.3 c. If they extended the program until 9 PM, how much revenue would you anticipate the government would make, again, assuming all tickets get paid?
Hint 1: How do I get started?8.6.0.4 d. What assumptions did you have to make to solve the problem? How realistic are these assumptions? What in the ‘real world’ could cause these assumptions to be faulty and how could you address this in a real consulting engagement?
Chapter 9 Hypothesis Testing
Hypothesis testing offers a structured way to inform business decisions using data. Our approach focuses on the conceptual and business aspects of the process at the expense of the technical aspects. This appears to be at odds with most texts on statistics which focus on technical details. We chose our approach for two reasons that come down to the relative strengths of software and humans.
Statistics is a computer-based activity and, generally speaking, software handles almost all of the technical aspects. It normally does this in an opaque way based on the choice of models and parameters. Once the macro decisions are made by the analyst, software implements a variety of but are important to get the right answer.
On the other hand, software has no idea what the business context of a particular problem is, nor can it say if a particular test is the correct one. So the analyst’s job is to construct and ask the right questions, understand the conceptual choices, and make sense of the results. This requires a conceptual understanding of the models but not the ability to produce the detailed calculations. We have tried to focus on those issues here.
Our coverage of hypothesis testing will start with building the logic of why you believe some things and not others, and a discussion of how to formalize a business question. You may think these things are obvious, but years of seeing people consistently make errors here tells us that they are not.
From there we will move on to the logic of testing, making sense of the results, and finally extensions.
Before we get started, there is an additional Excel file used for same of the examples in this chapter. You can download it here: Download Hypothesis_Testing_Examples.xlsx
9.1 Why You Believe Some things, But Not Others?
This may seem like a strange question, but since hypothesis testing is about establishing or changing a belief, it is quite important.
If you are like most people, There are some things that you are prepared to believe on the basis of your knowledge, experience, etc. without a specific demonstration of evidence. These are probably either things you take as being likely to be true or things whose truth or falsity is so trivial that you do not bother to question them.
For example, if I claimed that Toronto, because it is a large city, tends to be warmer in the winter than Cobourg, a nearby small town, you might believe that. You might believe it because you know something about the nature of urban heat islands and large cities. Or, you might believe me because you take me to be an honest person and because it really doesn’t matter to you whether or not it is true.
You might find it helpful to think of these as presumptive and de minimis beliefs respectively. Presumptive beliefs are based on what you believe to be true in the absence of further evidence. De minimis beliefs are ones whose truth or falsity is below your level of concern. In either case, you might be prepared to accept a claim, at least tentatively, without evidence.
For things that do matter and are outside of your scope of knowledge, you need to use a different approach. Here you need to assess, in some sense, whether or not the claim is improbable and therefore deemed unlikely to be true.
For example, if I claimed to have flipped a standard coin 20 times and correctly predicted the result each time, you would probably not believe me. In fact, if you had previously thought I was an honest person, this claim might make you change your opinion. The reason is that, since you know about binomial distributions, you could calculate the probability of my correctly predicting 20 coin flips as about one in a million. Clearly, it is possible, it is just not likely. So your doubt would be based on the improbability of the event, not it the impossibility of the event.
These examples may seem too contrived, so here are some business examples:
If I told you that a $10,000 subsidy for electric cars would increase their sales by at least 3%, you might believe me without any evidence because it seems to fit the standard economic models. In other words, you might presume, in the absence of other evidence, that it is true. On the other hand, if I told you it would increase the sales of electric cars by 3500%, you might need some evidence because that number seems improbable.
If I told you that occasionally complementing coworkers on their performance can improve workplace culture and effectiveness, particularly when a crises occurs, you might choose to believe me because the claim is so innocuous as to be beneath your level of concern.
9.1.1 What Does This Have To Do With Hypothesis Testing?
Hypothesis testing is about making decisions in the situation where neither presumptive nor de minimis beliefs apply. Hypothesis testing applies in cases where you are not convinced in the absence of data and the issue is sufficiently important for you to care. The process helps you make the decision by structuring the problem in such a way that you can determine the probability of an event and assess how likely it is. If it is sufficiently improbable, you will not believe a claim. In disbelieving the claim, it will point you to the correct course of action.
The details in hypothesis testing then come down to: how do you make business decisions, how do you establish the probability of a particular event, and how unlikely does something have to be before you reject a claim.
We will deal with each of these in turn.
9.2 How Should You Make Business Decisions?
Generally when faced with a business decision, we are attempting to make a tradeoff between ways of achieving one or more goals. These goals could be something like increasing profit, growing revenue, preventing bankruptcy, improving public image, improving customer service, or any other collection of outcomes – the specifics do not matter.
With this recognition, it is clear that most, if not all decisions, can be framed in the form: “Should we do option x?” In this context, the ‘option x’ could be just about anything – or any collection of things. The point is that we are considering an option and we have some reason for considering it.
As professional decision makers we should undertake decisions, and the actions they imply, only if the evidence supports those decisions. In other words, we should require evidence to justify that our action is the correct one.
This decision making approach may seem obvious. In fact, it may seem like the only way to make a decision, but there is another and wrong way to think about this. You could choose to do something unless there is evidence to suggest that it is a bad idea.
A couple of examples might help. Consider the following business proposals:
“We are thinking of expanding our range of products if the evidence supports the move.”
This seems sensible. We are proposing a business option and questioning whether or not we should do it. Presumably the next step would be to clarify why we would do it and then assess the evidence in light of that objective. If the evidence supports expanding the range of products, the company should do it.
Now consider:
“We are going to expand our range of products unless we find evidence not to do so.”
This one doesn’t look so good. Presumably, whatever reason(s) for expanding the product range that applied in the previous case applies here. We have also decided that they cannot presumptively conclude the evidence supports their conclusion in this case, nor is the issue so minor as to be unworthy of thought. Yet, unless they find evidence to the contrary, they propose to expand the range of products. This means that the less data they collect, the more likely they are to expand their product range. At the limit, if they find no evidence at all – in fact, if they do not even look – they will expand the range of products.
This is a bad way of making business decisions. If it is not obvious, consider that ‘expand our range of products’ could be replaced by literally any other business decision no matter how nonsensical and justified by the same lack of data.
Clearly we need a better approach.
9.3 The Logic of Testing
Hypothesis testing is going to proceed on a very simple 5 step process. 1. We establish a null and alternative hypothesis. 2. We determine a threshold for rejecting the null, called \(\alpha\) 3. We collect a sample of data and determine its mean, \(\bar{x}\) 4. We assume the null hypothesis is true and use that assumption, along with the CLT, to calculate the probability that a result like \(\bar{x}\) would have occurred. This probability is called a p-value 5. If the p-value \(\leq \alpha\), we reject the null hypothesis and conclude the alternative is true. This implies that we should undertake the action that informed the alternative hypothesis. If the p-value \(> \alpha\), we do not reject the null, we do not embrace the alternative hypothesis and we do not undertake the action that informed the alternative hypothesis
Sounds simple enough. Let’s look at each step in detail.
9.3.1 Establish the Null and Alternative Hypotheses
We do this by first defining the alternative hypothesis using the following business logic: \[H_1:\; <We\; will\; do\; x>\; if\; <the\; data\; shows\; y>\] We start this process in English and then nudge it into mathematics. For example, if we were considering developing a new product, we might say: \[H_1:\; <We\; will\; develop\; the\; new\; product>\; if\] \[<the\; data\; shows\; the\; market\; is\; large\; enough\; to\; recoup\; development\; costs\; within\; 3\; years>\] We would then be tasked with translating the second part into mathematics. This invariably involves business acumen, managerial input and mathematical skill. What it does not involve is a belief about what the market size actually is – though clearly the belief in a large market could motivate this.
Let me illustrate this with a more detailed example.
Our crop sells for $8/kg at the farm gate. The manufacturer claims that $1,000 worth of fertilizer per acre will increase our crops yield by 8,000 kg with no additional costs. We suspect the increase in yield will be about 4,000 kg. Intrigued by the option, we ran a pilot test on a random sample of 50 acres last year and collected the data. Should we use the fertilizer on all 5,000 acres of crops this year?
\[H_1:\; <We\; will\; use\; the\; fertilizer>\; if\] \[<\; the\; data\; shows\; that\; using\; the\; fertilizer\; increases\; profits>\]
Now, we nudge from English to math: \[<the\; data\; shows\; that\; using\; the\; fertilizer\; increases\; profits>\] Really means: the value of increased yield is greater than the cost of the fertilizer, or in math: \(Increased\; Yield\; * \$ 8/kg > \$1,000\). Let’s take this further:
\[Increased\; Yield > 125kg\] \[\mu > 125kg\]
The alternative hypothesis will always be expressed in terms of one or more population parameters. For it will never contain an equality sign when dealing with continuous distributions.
I would like to point out that in setting up our alternative hypothesis we have not referenced our belief at all in this, nor did we use the manufacturers claim. We cannot stress enough that neither of those things had anything to do with the decision. The decision should be based on the objective, which here is assumed to be increasing profits. That is based on the actual effectiveness of the fertilizer relative to its cost – not our belief nor the manufacturers claim.
Determining the alternative hypothesis could be quite difficult. It can even involve conceptual challenges. For example, the simple business assertion: \[H_1: <We\; should\; open\; a\; store\; at\; location\; X>\; if\] \[<the\; data\; shows\; it\; is\; a\; good\; location>\] Might be very hard to translate into mathematics. While in English that may be a sensible criteria, the specific interpretation could be understood to have different meanings to different people. Some might think the location gets enough traffic, others that the real estate price is low enough, others that growth in the future will be high.
In cases like these, starting in English and nudging to mathematics is an excellent approach because it establishes the chain of reasoning in a manner that could be discussed with non-technical people. This way no one is surprised by the resulting question.
Once the alternative hypothesis is established in mathematics. The null is constructed as the complementary set. So for any number in the alternative, in this case 125, we get a complementary null: \[H_1: \mu > 125 \Rightarrow H_0: \mu \leq 125\] \[H_1: \mu < 125 \Rightarrow H_0: \mu \geq 125\] \[H_1: \mu \neq 125 \Rightarrow H_0: \mu = 125\] These are constructed to dichotomize the universe – exactly one of them must be true. This is important for the logic of hypothesis tests.
The resulting tests are called one-tailed tests when the alternative hypothesis has a less than or greater than sign only. They are called two-tailed tests when the alternative hypothesis has a not equal sign (or equivalently both a greater than and less than sign).
Optional: Technical Details: Misunderstandings Regarding the Null and Alternative Hypotheses9.4 Determine The Rejecting Threshold
When we discussed why you believe some things and not others, we argued that you disbelieved things because they are unlikely. If someone claims an event with probability of 0.00000001 has occurred, it is pretty safe to disbelieve it; if someone claims an event with probability of 0.34 has occurred, you do not have much justification to doubt them.
Somewhere between the two extremes we need to establish a cutoff point. In hypothesis testing, that number is called \(\alpha\), and it has an important interpretation that has to do with the types of mistakes you can make and their consequences. You should select α in a way to minimize the cost of mistakes.
Optional: Technical Details: Why \(\alpha\)?9.4.1 Types of Errors
Any time you make a decision regarding a course of action (in reference to an objective), there are You could choose to take the action when it is not justified and therefore fail at it; or you could choose not to take the action when you should have and therefore miss the opportunity to have achieved your results. These are known as type I and type II errors respectively.
Each of these errors has a cost to it, and in spite of what you may read online, it is impossible to say in general which one is more costly. To see, let’s examine them further.
9.4.1.1 Type I Error
A type I error occurs when you reject the null hypothesis even though it is true. Given our alternative hypothesis was constructed with the logic: \(H_1:\; <We\; will\; do\; x>\; if\; <the\; data\; shows\; y>\), to make a mistake implies that we thought the data supported the proposed course of action, so we undertook that action. Sadly, we were mistaken and the action was not justified, so we will fail at that action. Consequently we will experience the cost of failure.
For example, we were considering purchasing fertilizer for our farm. We ran a test and decided that we should purchase the fertilizer. Unfortunately, the fertilizer did not work as well as we had expected, so it fails to deliver the benefit we had hoped for, so we experience the cost of failure.
Optional: Technical Details: Cost of Failure\(\alpha\) is specifically associated with the probability of a type I error. In fact, it is the of a type I error. Specifically it the probability of making a type I error given that the null is true at the equality bound. So selecting a small \(\alpha\) protects against the cost of failure.
9.4.1.2 Type II Error
A type II error occurs when the null hypothesis is false but we do not reject it. Given our logic for the alternative hypothesis: \(H_1:\; <We\; will\; do\; x>\; if\; <the\; data\; shows\; y>\) we will have concluded that the data did not support the action, we will not have undertaken the action and therefore will have missed an opportunity to achieve our objectives.
The probability of a type II error is called \(\beta\) and is a function of We can minimize the risk of a type II error by increasing power – which is the probability of rejecting the null hypothesis given that it is false. \(Power = (1-\beta)\), and increases with sample size, effect size, reductions in variance, and increases in \(\alpha\), though
Optional: Technical Details: Opportunity Cost The types of errors and their consequences are summarized in the following table:Figure 9.1: Types of Errors & Consequences
9.4.1.3 Collecting Data and Calculating its Mean
In earlier chapters we have discussed data collection and sample size determination. You will typically want to think about these issues when conducting a hypothesis test. They are as important as getting the question right in the first place.
While people typically under appreciate it, any mistakes you make in sampling compound on the formal errors of hypothesis testing – possibly fatally so. If your sample does not reflect your population, don’t expect your results to apply to the population either. Deviation between the sample and population can occur because of poor sampling techniques but they can also occur because of changes in the real world – what people are willing to pay for a particular generation of technology changes the moment a new generation is released.
Even if your sample reflects the population, you want to ensure you have a sufficiently large sample size to detect the result you are looking for. You could do this by determining what would constitute the minimum practically significant result and then ensuring your sample size would be large enough to detect that with a high level of confidence.
Sample size is the determinant of power that is most easily controlled by the analyst. Power, and therefore sample size, is also related to conviction – the belief in your result. While it is beyond the scope of this chapter, it turns out that if you think about a hypothesis test as being a piece of evidence that shapes your view of the truth, the greater the power, the more it should influence your belief. With a small sample you may ‘get lucky’ and find the result you expect, but it is not clear that a small sample test should influence your belief very much, again, because it is much easier to just ‘get lucky’ with a small sample.
Lastly, a large sample allows you to smaller effects. This can be very important if a small effect can be of practical significance.
Optional: Technical Details: Statistical vs. Practical SignificanceOnce the sample is collected, calculating the mean is easily done by statistical software. With Excel, you can do this using the function =AVERAGE.
9.4.1.4 Calculate the p-value
This is the point where everything comes together for hypothesis testing. By now we have specified our hypotheses in a way that links to our business problem, decided upon a standard for testing, decided on \(\alpha\), and collected data. All that remains is to determine a probability and make a decision. It follows a wonderful logic that unfolds in three parts.
Part 1
We assume the null hypothesis is true at the equality bound. This tells us where the mean of the distribution is actually located.
Part 2
We use our sample to calculate the mean and sample standard deviation and use it to create a t-statistic using the estimated mean, \(\bar{x}\), and standard deviation, \(s_x\) from the sample, along with the population mean, \(\mu_{H_0}\) assumed under the null hypothesis in the same formulas we have seen before: \[t_{df=n-1} = \frac{\bar{x}-\mu_{H_0}}{\frac{s_x}{\sqrt{n}}}\] As we have seen in previous chapters, this statistic can be used to determine how
the sample mean is relative to the hypothesized mean. The more extreme a value is, the lower the probability of getting a result at least as far away from the hypothesized mean in the direction of the alternative. There are a few possible results, based on the type of alternative hypothesis and whether or not the data is consistent with rejection. Each result is shown below.
The best way to think of it is with the rhyme ‘the direction of rejection’. In order to reject the null hypothesis, the data must be consistent with the alternative hypothesis, which is to say, it must be ‘in the direction of rejection’. If the null is \(H_0: \mu \leq 0\), then the alternative is \(H_1: \mu > 0\). This means that ‘the direction of rejection’ is to the right, so the p-value is calculated to the right and only values that are extreme to the right could lead to rejection as shown below.
Figure 9.2: Reject to the Right
If the calculated t-statistic is on the left, it is consistent with the alternative hypothesis so the p-value is somewhat greater than 0.5 and one cannot reject. If you think about it, this makes sense because the data was consistent with the null hypothesis so it doesn’t provide any evidence against the null.
If the null is \(H_0: \mu \geq 0\), then the alternative is \(H_1: \mu < 0\). This means that ‘the direction of rejection’ is to the left, so the p-value is calculated to the left and only values that are extreme to the left could lead to rejection as shown below.
Figure 9.3: Reject to the Left
Finally, if the null is of the form \(H_0: \mu = 0\), then the alternative is \(H_1: \mu \neq 0\), and the direction of rejection is to both the left and the right. Here we do something that seems strange, we calculate the p-value as the probability of a value more extreme in one direction, then multiply it by two.
Figure 9.4: Reject to the Both Sides
The multiplication by two may seem odd but it captures the idea that if the null claims that \(\mu = 0\), values that are either very large or very small relative to the value assumed by the null are equally surprising and would have provided equivalent evidence against the null. Doubling the p-value accounts for this.
Part 3
Once we calculate the p-value, we can decide whether or not the result was sufficiently improbable for us to doubt that it actually happened as claimed. This is the part where alpha comes in – mostly as a tie breaker.
Suppose that the p-value were 0.000000000001, we could live many lifetimes and never see an event as rare as that. That is like bending over to pick up a winning lottery ticket on the side of the road, only to get struck by a meteor. That kind of thing basically never happens, so we would have to conclude something went wrong. If we work our way back through the logic there are three places to consider:
The last thing we did was to use the t-distribution and the CLT to calculate a p-value. This is clearly not the problem. The t-statistics calculations are done by computers – you could check it, but it will be right. The CLT is not the problem either because the logic of the CLT is a very well established theorem in statistics. We will have to look further.
Before we did that, we used a sample to calculate a test statistics. Clearly we could have made a mistake in sampling – but we have already been over this and acknowledged that any sampling mistakes were in addition to whatever other mistakes we could make – so this is not the problem. We will have to look further.
The only other thing we did was make the assumption that \(H_0\) was true. This was only an assumption – one that we may not have believed in the first place. This is the only place left where things could have gone wrong, so either we just observed a one in a gazillion event or the assumption was wrong. Pragmatism suggests that the assumption in \(H_0\) was wrong. Since exactly one of \(H_0\) and \(H_1\) had to be true, and \(H_0\) is false, \(H_1\) must be true.
That being the case, we should take the action implied by the alternative. After all, that’s why we did the test in the first place. So far, so good.
On the other extreme, if the p-value had been 0.47 or 0.81 or something like that, it would seem that we observed a relatively likely event. Those kinds of things happen all the time. There would be no chain of reasoning that would make us question the assumption. We would not have evidence to reject it and certainly not evidence to embrace the alternative.
As practitioners of analytics and AI, we should not take action that the data does not support, so we should not take the action suggested by \(H_1\). Again, things are looking good.
What about the in-between area? How do we adjudicate whether the result was sufficiently unlikely? That is where alpha comes in: we use the rule: if \(p-value \leq \alpha\), reject \(H_0\) and embrace \(H_1\); otherwise do not reject \(H_0\) and do not embrace \(H_1\).
And the cool part is most of this is handled by a computer. Since software varies considerably, we will finish up this section with the output from an Excel calculation.
Example Question
Following our process, the alternative hypothesis is: \[H_1: <I\; will\; conclude\; January\; was\; colder\; than\; normal>\; if\] \[<the\; data\; shows\; that\; January\; is\; colder\; than\; normal>\] \[H_1: <Average\; Daily\; Temperature\; is\; below\; -5\; degrees\; Celsius>\] \[H_1:\mu < -5\]
The ‘Kingston in January’ tab from the Hypothesis Testing Examples Excel sheet has the data we need to calculate the t-statistic.
We can use the t-statistic equation: \[t_{df=n-1} = \frac{\bar{x}-\mu_{H_0}}{\frac{s_x}{\sqrt{n}}} = \frac{-8.4774...-(-5)}{\frac{6.29739}{\sqrt{31}}} = -3.0745\] And then plug our result into the T.DIST function: =T.DIST(-3.0745192,30,TRUE)\(=0.002232\).
Using \(\alpha=0.05\) we find that the p-value is less than alpha, and therefore we reject the null hypothesis and can conclude that the temperature is lower than average. We have also completed these steps in the Excel sheet referenced above.
Having studied tests of a single parametr, we will now extend them to two parameters.
9.7 Summary: Two Parameter Testing Tree
Here is a concise summary of how to approach testing with two parameters. Here, independent means that there is no common source of variation in the data.Figure 9.9: Testing Tree
Most software will allow you to run a specific test for equal variance (though Excel does not seem to do this!)
If your software does not do this, or you do not know how to do it, you are a bit safer using weighted rather than pooled. Weighted tests are more robust because they do not make an assumption that the two populations have equal variance. This robustness has a minor cost in that pooled tests are a bit more powerful.
Optional: Technical Details: Statistical vs. Practical SignificanceChapter 10 Hypothesis Testing Practice Problems
Below are some sample problems to help you practice what we have learned about hypothesis testing. In general, they get harder as we progress. Sample solutions are available as part of the “Stats for AI Excel Companion” file.
10.1 Ice Cream
The ice cream data contains sales of chocolate and vanilla ice cream in Kingston. I have always thought that vanilla sells better than chocolate and I am prepared to accept this belief in the absence of data. If they are not sold at the same rate, I might need to change my ordering policy.
10.1.0.1 a. State the appropriate null and alternative hypothesis for the problem
Hint 1: How do I construct the hypotheses?10.1.0.2 b. Use the data to test the hypothesis assuming the data are independent. Report the p-value and interpret the results. Did you have to make any assumptions here?
Hint 1: What kind of test do I use?10.1.0.3 c. Use the data to test the hypothesis assuming the data are matched in pairs by day. Report the p-value and interpret the results. Did you have to make any assumptions here?
Hint 1: What kind of test do I use?10.1.0.4 d. Explain in language a manager is likely to understand: 1) When a paired t-test is appropriate. 2) Why a paired t-test is a more powerful test when it is appropriate.
10.2 Electric Cars
I am thinking about buying an electric car. From what I have read, the big risk with electric cars appears to be how far one can go on a single battery charge. Apparently this depends on characteristics such as weather, amount of traffic, speed of travel, etc. So I am concerned that the distance I can travel on a single charge may not be far enough. I would be comfortable if the average travel distance was greater than 280 km. I found a sample of 49 unrelated trips for the type of car I propose to buy.
10.2.0.1 a. State the appropriate null and alternative hypothesis for this problem
Hint 1: How should I construct the hypotheses?10.2.0.2 b. Explain how I might best think about choosing an alpha for this test. Be sure you indicate what it means.
Hint 1: How do I think about alpha?10.2.0.3 c. Perform the test based on the available data. Calculate and report the p-value for the test and interpret the results using alpha = 5%. What can you conclude?
Hint 1: What test do I use? Hint 2: What values do I need?Hint 3: How do I calculate the probability?
10.2.0.4 d. Irrespective of what you found, assuming this test yielded a very sound rejection, what would that tell me about the probability that I can expect to travel more than 280 km on any given trip?
10.3 Hold Your Breath
Apparently the average person can, with training, hold their breath for 2 minutes. (Frankly, I don’t believe it, but that’s what was claimed on a random page I found on Google!)
10.3.0.1 a. Suppose that the mean is 120 seconds and the standard deviation of this is 20 seconds. If we had 100 people attempt to hold their breath as long as they could, how likely is it that the average of that sample would be longer than 140 seconds?
Hint 1: How do I get started?Hint 2: What values do I need? Hint 3: How do I calculate this in Excel?
10.3.0.2 b. Consider the data provided – what do you think of the claim made that people can hold their breath for two minutes: construct an appropriate null and alternative hypothesis and perform the relevant test? Explain any decisions you had to make and the result of your analysis.
Hint 1: How should I construct the hypotheses?Hint 2: What test should I use? Hint 3: What values do I need? Hint 4: How do I calculate the probability?
10.3.0.3 d. Irrespective of what you found, assuming this test yielded a very sound rejection, what would that tell me about the probability that I can expect to travel more than 280 km on any given trip?
10.4 Greener By Design (Part 1)
Greener by Design has been working in Montreal performing environmental audits for several years. Lately the owner has been concerned that the sales are falling and in particular that average daily revenue from audits in July are lower than those in June. If sales are falling, she plans to abandon the market and will stop investing in further advertising.
10.4.0.1 a. State the appropriate null and alternative hypothesis for this problem.
Hint 1: How do I structure the hypotheses?10.4.0.2 b. Perform an appropriate test, present your p-value and interpret the results based on an alpha of 10%.
Hint 1: What test should I use?10.4.0.3 c. Based on these results, should she abandon the Montreal market? Provide a reasonably comprehensive answer.
10.5 Greener By Design (Part 2)
Greener by Design has been working in Vancouver and Toronto performing environmental audits for several years. The owner thinks that, given her staff, the two markets should be equally profitable, but fears the Vancouver office may be underperforming and require a change in management. She has been friends with the current manager for years and would prefer to keep him, but business is business (isn’t it?). Using the data provided in the Excel file, advise her.
10.5.0.1 a. State the appropriate null and alternative hypothesis for this problem.
Hint 1: How should I construct the hypotheses?10.5.0.2 b. Perform an appropriate test, present your p-value and a choice of alpha.
Hint 1: What test should I use?Hint 2: How should I set alpha?
10.5.0.3 d. Based on these results, should she change the management in the Vancouver market? Provide a reasonably comprehensive answer.
Chapter 11 Introduction to Modeling
11.1 Introduction to Regression (Part 1)
Linear regression modelling is one of the first truly powerful analytics models that a data scientist encounters. These models encompass and extend many of the statistical techniques one encounters in an introductory class in statistics. In addition to being powerful in their own right, regression models introduce the structure for related models in that they disentangle how multiple independent variables combine to influence an outcome of interest and simultaneously measure the separate contributions of those independent effects. This approach is used in other models, such as discrete choice modelling (Probit, Logit, etc.), models for duration (hazard models), models for counts (Poisson regression), and models for truncated and censored data (Tobit, the truncated regression model, and others).
The general linear regression model combines a series of parameters (\(\beta_0, \beta_1, ..., \beta_K\)) along with the explanatory variables (\(X_1, X_2, ..., X_K\)) into a linear equation of the form:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_K X_K + Error\]
The belongs in every regression model,
A particular form might look like this:
\[Earnings = \beta_0 + \beta_1 Years\_Experience + \beta_2 Years\_Education + Error\; Term\]
The regression method provides an estimate for each of the parameters in the model. It also provides a standard way of testing the parameters, along with diagnostic tools that provide additional information about the fit of variables in the model.
Regression is a really awesome tool!
But, before getting too far into the details, we should acknowledge that there are at least two broadly different views on the role of regression modelling in analytics. Both views are legitimate and lead to useful application of regression modelling, and both are mathematically justified, but they do result in different applications of the tool and lead to different kinds of actions.
Technical Detail
Regression models have both mathematical and statistical properties. The mathematical properties flow from the matrix algebra that defines the model and hold irrespective of the characteristics of the data. The statistical properties follow from the characteristics of the data and the model’s fit with the data generating process (DGP) that creates the data. As we will discuss later, the statistical properties only apply if certain assumptions about the data hold. When they are valid, these statistical properties allow us to develop deeper insights about the problem we are analyzing. The econometric modelling techniques tend to focus a great deal of effort on validating the statistical properties of the models; the black box techniques tend to rely on the mathematical properties only.
At risk of trivializing both methods and emphasizing differences that may be more conceptual than real, we will offer a quick overview of the two.
11.2 Econometric vs Black Box Regression
11.2.1 The Econometric Approach
The econometric view of regression, also known as causal modelling, sees the model development as an interaction between theory as expressed through mathematics and data through the statistics of regression. The mathematical model involved is thought of as describing a simplified version of the process where the variables ‘cause’ the behavior of the dependent variable. For example, a modeler may start with a ‘theory’ that sales are primarily driven by prices and advertising budgets amongst other factors. Based on this theory, she might build a simple model:
\[Sales = \beta_0 + \beta_1 Ad\_Budget + \beta_2 Price + Other\; Factors\; (i.e., \;the\; error\; term)\]
to reflect her thinking that as advertising budgets increase, sales will also go up; and that as prices increase, sales will go down. With the mathematical model developed, the data is then applied to find the best fit to the parameters (\(\beta_0\), \(\beta_1\), and \(\beta_2\)), along with reams of diagnostic information on the quality and fit of the model. The modeler uses these results to develop insights and to refine the model to reflect her improved understanding of the causal relationship involved. Perhaps her analysis suggests that one of the important other factors is the competitors price. She might modify her model to be:
\[Sales = \beta_0 + \beta_1 Ad\_Budget + \beta_2 Price + \beta_3 Competitor\_Price + Other\; Factors\]
This process can be thought of as a very hands-on process that is not easily automated. It involves an analyst using many tools and techniques to improve her understanding of the world. Often this involves looking at graphical output, running specification tests, creating additional variables, and other techniques.
For reasons that will become clear later, the goal will be to find the model that best fits the data with as much of the variation in the dependent variable being explained by the relationship described by the explanatory variables that truly belong in the model. Ideally, very little will be left in the category of other factors / residual / error. A good econometric model should have, amongst other things, a collection of explanatory variables that explain a large amount of the variation, a credible justification for the each variable in the model and a collection of other factors.
11.2.2 The Black Box Approach
The black box method takes a more hands off approach with respect to theory. Aside from setting the content of the original data sets, these techniques (e.g. ridge regression, lasso regression) are driven by algorithms that are highly automated and involve little or no human oversight.
Due to their algorithmic nature, these techniques do not leverage theory of causation. Instead, they focus on managing the tradeoffs between the number of explanatory variables used in the model and the amount of variation explained by the model. Since these are data intensive methods they tend to be concerned with the possibility of overfitting the models – which is to say, building a model that fits a particular data set but is not generalizable. Black box methods can be exceptionally useful when there is a lot of data and the goal is to predict an outcome. They are also useful in situations where there is little in the way of theory to explain the data; the variables that theory would require are not available; or the process must be automated due to resource constraints or turnaround requirements.
Generally speaking, a good black box model is one where a small collection of explanatory variables reliably predict the dependent variable. In particular, the model should be able to predict results based on samples that were not used to develop the model in the first place.
11.2.3 Which Approach is Better?
If you are looking for a clear winner here, you are going to be disappointed. Each method has its strengths and weaknesses.
Relative to the black box method the econometric method is obviously more labor intensive, but it also brings non-sample information to the process in the form of a theory. It also allows the model to be adjusted to reflect the business acumen of the modeler. Perhaps more importantly, the theory is correct – and that is a mighty big iff – then the estimates that come from the econometric method have a credible claim of measuring causal effects. If so, then they can be used for which is a huge advantage over black box methods which can only be used for predictive analytics. While that may sound like a clear win for the econometric method, there are cases where black box techniques easily outperform the econometric technique, particularly in In part this is true because regression is only one of many techniques available to the black box approach, others include decision trees, neural networks, time series analysis, and others.
Even within the scope of regression models, the black box approach can still beat the econometric approach in developing predictions. The black box method can consider thousands of models in the time it takes the econometric method to develop a single model. If there is little value from external theory or the data required by such a theory is not available, but a large quantity of potentially relevant data is available, the speed of the black box method may find a better model than an econometrician could, and do it much faster.
As if to prove this point, in at least one case, econometricians themselves developed black box forecasting techniques due to the difficulties associated with traditional econometric prediction. The field known as time series analysis and models such as ARIMA were developed due to the failure for econometric models to provide reliable forecasts using traditional causal modelling.
One conclusion might be that the econometric approach should be used when an insight about the causal process is required to inform rather than automate a decision; when non-sample insights about the data are going to be important in shaping the analysis and when perscriptive analytics are required. Black box techniques should be used when automated predictions are required; when massive amounts of potentially relevant data are available.
As a data scientist, you should understand, be able to use, and explain both approaches. That said, the econometric approach has a more restrictive set of assumptions because of its use of the statistical properties of regression models, its focus on human-intermediated processing of data and aspirations to reflect causation. For the balance of the book, we will focus on the econometric approach recognizing that what we say may will certainly prepare your data for black box analysis.
11.3 Introduction to Regression (Part 2)
Most textbooks introduce linear regression with the simple linear regression model. This is a model with a single independent variable. Its general form is:
A specific model might look like this:
\[Sales = \beta_0 + \beta_1 Ad\_Budget + Error\]
The model is almost never used in practice because most things a data scientist would like to model are explained by more than one factor. The benefit of the simple linear regression model is that it is easy to show graphically. The function used in the linear regression model produces a chart like this.
Figure 11.1: A Simple Linear Regression Model
The general linear regression model, as described above, has the form:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_K X_K + Error\]
A particular form might look like this:
\[Sales = \beta_0 + \beta_1 Ad\_Budget + \beta_2 Price + Error\]
Naturally, this would be a fair bit more difficult to show in a graph, since it is at least three-dimensional. There is a nice trick, however, which is to focus on a single variable, while holding the others constant at some value like 0, which preserves the interpretation of \(\beta_0\), or their mean, which makes for much more sensible depictions of the results. By doing this, we can depict the relationship between one explanatory variable and the dependent variable.
Figure 11.2: A Linear Regression Model, All Other Variables Held Constant at 0
Since the multiple regression model can be depicted quite easily using either of these techniques, we will use it throughout the rest of the book holding other variables constant at zero for simplicity recognizing that the readers of our books are smarter and can make this adjustment for themselves.
11.3.1 What does Linear Regression really mean?
Given that we have a general linear regression model, you might be wondering what it actually means. From your previous studies, you should be familiar with the idea that, for any variable you could consider, the value of that variable is the average of the class to which you belong + your individual deviation from the average.
This is pretty abstract, so let’s use a specific example. Now, it is clear that we don’t know you, but we do know that your earnings last year are described by the following equation:
\[Your\; Earnings = Average\; of\; your\; Class + Your\; Individual\; Variation\]
This is really all that a regression equation is doing. When we have the general linear regression model:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_K X_K + Error\]
The x-variables define the class to which the observation belongs. Sometimes the x-variables are discrete variables like \(X_1\) = 1 if employed; 0 otherwise. More often they are continuous, like \(X_2\) = number of hours worked, and occasionally they are interactions like \(X_3\) = yrs_education * employed. The individual variation is the prediction error, or as we suggested above, the other factors that have been omitted from the model.
In short, a linear regression model simply provides the conditional average of a specific dependent variable, \(Y\), based on a collection of one or more independent variables, \(X_1 ... X_K\), that can be continuous, discrete or some combination of the two.
11.3.2 Modelling with Linear Regression
The linear regression model is limited by being linear, but linear in this context is not as restrictive as you might suspect. To see why, let’s consider a few things you can do with a linear regression model.
11.3.2.1 Modelling Different Groups
Linear regression allows you to model distinct groups that have similar responses to some variables but different responses others. Suppose your client, a restaurant manager, suspects his customers spend more at the restaurant the higher their incomes. This would suggest a model:
\[Sales = \beta_0 + \beta_1 Income\]
But, he might also suspect that male customers spend more than others. One way to model this would be to assume that male customers spend the same amount more at any given level of income. This would suggest a model like:
\[Sales = \beta_0 + \beta_M Male + \beta_1 Income\]
Where those observations belonging to a male respondent are marked by a Male = 1 if the customer is male; 0 otherwise.
Figure 11.3: Linear Regression with a Dummy Variable
You can confirm that the regression model produces these two distinct relationships by considering the set of male customers vs. others. If Male = 0, the customer is not male and the model becomes:
\[Sales = \beta_0 + \beta_1 Income\]
If the customer is male, then Male = 1 and the model becomes:
\[Sales = \beta_0 + \beta_M(1) + \beta_1 Income\]
\[=(\beta_0 + \beta_M) + \beta_1 Income\]
Confirming that we can have two distinct ‘curves’ in one linear regression model and that the Male line is exactly \(\beta_M\) units above the Not Male line.
Two things are worth noting here. In this model the ‘base case’ is the not-male case. The impact of being male is measured relative to this base case. While this may seem trivial, particularly if your data set contains only male and female respondents, but it becomes important when we deal with more complex situations. For example, we might use a similar approach to predict the likelihood of defaulting on a loan payment based on the city someone lives in. In this case, we might have 15 different cities across Canada, each with their own categorical dummy variable and we would have to think carefully about what the base case was. We would also have to think carefully about how to interpret the results.
The second thing is that we have drawn this as though the male curve was above the other curve. In actual fact what we have really done is permitted the model to reflect a difference between male customers and others. The data will decide whether that difference is real and whether it is positive or negative. This illustrates a general principle about modelling and ultimately testing: to test a hypothesis, we develop a model that allows the data to express a characteristic and then test to see if it actually does. Here we SUSPECT that males consume more, so we build a model that allows the data to express males consuming more and then see what the results tell us. Perhaps males consume less, or perhaps there is no difference. We will return to this issue when we discuss testing.
Returning to our example, perhaps upon seeing the output from this model, the client revises his thinking – he now believes that what really happens is that male customers spend a greater share of their income on food in the restaurant, rather than simply purchasing the same amount more at any given price. This could be captured by a different model:
\[Sales = \beta_0 + \beta_1 Income + \beta_M Male * Income\]
This model would produce the following relationship:
Figure 11.4: Linear Regression with an Interaction Dummy Variable
By considering the set of male customers vs. others, you can confirm that this single model produces two sloped lines with the same intercept. As before, this model might actually produce a result where sales to the male set of the population is less responsive to income than those to the base case. If that were the case, βM would be a negative number.
11.3.2.2 Modelling Different Time Periods
In addition to modelling different groups, models can be developed to capture different effects at different time periods due to a in the real world process. Models like this are used when the nature of the relationship changes at a particular point in time, rather than evolving slowly over time. They can be used to reflect how relationships might change due to tax policy, competitor behavior, technological change, or any other significant event.
For example, your client, an automotive parts supplier, may have been experiencing an increase in demand for electric car components over time and would like a simple model to forecast demand in the future – perhaps to know when to expand capacity. In this case, a model like
\[Sales = \beta_0 + \beta_1 Time\]
might be a good starting point because. It certainly has the virtue of simplicity and is be easy to develop and use. The relationship captured by the model would look like:Figure 11.5: Linear Regression over Time
Now suppose that at some point in time \(T*\), the government introduces a subsidy to increase the sales of electric cars. At that point, your client expects that the sales of electric cars will jump up, but then continue growing at the original rate. In other words, your client expects that sales look like this:
Figure 11.6: Linear Regression with a Jump Change at Time T*
Based on the discussion above, how could you build such a model? Try writing out the equations that would do it, then test to see if it actually works. Spend a minute or two on this, then check your results below.
Consider one more structural change model. Suppose your client believed that the structural change would not cause a jump, but instead would cause the rate of growth to increase at time period \(T*\). In other words, the relationship she expects looks like:
Figure 11.7: Linear Regression with a Slope Change at Time T*
Again, try to develop this model, but be warned, it is a fair bit more tricky than the previous one.
Hint 1: It will involve a dummy variable
Hint 2: It will involve a new variable for time.
Hint 3: Solution
11.3.2.3 Modelling with Curves
The ‘linear’ part of linear regression suggests that the model can only capture straight lines. This is not the case – by transforming variables, a wide class of nonlinear relationships can be captured by linear regression. There are limits to what can be done, but can model many ‘non-linear’ relationships including: polynomials, functions of \(\frac{1}{x}\), log linear relationships, and others.
For example, your client may generally believe that sales increase with advertising budget, but that the rate of increase decreases as the So, the relationship she envisions is one like that shown below.
Figure 11.8: Linear Regression with Decreasing Returns to Scale
Such a model might be by either of these two models. The first is a simple rational function, also known as a reciprocal function:
\[Sales = \beta_0 + \frac{\beta_1}{Ad\_Budget +1}\]
where we would expect \(\beta_1\) to be a negative number.
The second is a polynomial function, in this case, the function is a second order polynomial, though higher ordered ones could be used in principle:
\[Sales = \beta_0 + \beta_1 Ad\_Budget + \beta_2 Ad\_Budget^2\]
Where \(\beta_1\) would be positive and \(\beta_2\) would be negative.
Their respective graphs would look like this:
Figure 11.9: Linear Regression with a Reciprocal and Polynomial Curve
The first model’s shape is consistent in that it approaches some upwards limit set by \(\beta_0\), but the slope is quite restrictive and is set by a single term, \(\beta_1\). The second model has more flexibility because it has three estimated parameters, but clearly only applies over a narrow range of observations since sales actually go down as Ad_Budget increases beyond about 20.
In addition to showing how curves can be captured by linear regression, these models illustrate the important point that models are designed to apply to a narrow range of data. Beyond that range, they do not have a reasonable hope of predicting results.
These two models were applied to ranges of advertising budgets from 0 to 30 – it is probably not reasonable to attempt to fit a single model to such a wide range of data. If the models were fit to a narrower range of data, it would tend to fit better because the extreme characteristics – the steep slope at the beginning of the first function and the downward curve at the end of the second, would not appear in the range of observations.
Our purpose here is not to claim that these are the best models to use in any particular setting, rather, it is to demonstrate some of the range of modelling options available in linear regression.
11.3.3 Regression Assumptions
Now that you are familiar with what regression can do, let’s explore the requirements for the regression model. As we mentioned earlier, linear regression has both mathematical and statistical properties. Most of the assumptions are necessary for the statistical properties to hold, but two are required for the mathematical properties to hold.
11.3.3.1 Mathematical Requirements
1. There must not be a perfect linear relationship among any of the x-variables.
This means that none of the \(x_0 ...x_K\) variables can be expressed as \(x_i = \sum_{all j \neq i} (a_j x_j)\).
You may have noticed we said all \(x\) variables from \(x_0\) to \(x_K\) – even though we typically refer to having \(k\) \(x\) variables. That was not mistake, the linear regression model assumes that a column of ones is provided as one of the \(x\) variables, this invisible \(x\) variable is associated with the \(\beta_0\) term.
We mention this because this requirement is typically only violated by dummy variables when one includes a dummy variable for every category in the model. Suppose your data set contains a single gender field that can be coded as male or female. You might be tempted to specify a model similar to the one above that modeled male sales in restaurants, only with both male and female characteristics:
\[Sales = \beta_0 + \beta_M Male+ \beta_F Female + \beta_1 Income\]
With suitably defined dummy variables, Male and Female. The problem here is that, since all observations are coded as male or female, you get:
\[1 = Male + Female\]
Which violates the requirement and will result in an when there regression is run.
To avoid this, a category must always be omitted when using dummy variables to indicate category membership. That omitted category becomes the base against which all other effects are measured. In the case above, we used Female as the base category and treated Males as deviations from that category.
The only other time we have seen this emerge in practice is when variables are derived from each other. For example a model might contain, Age, Experience and Education, but somewhere along the line, experience might be calculated by a proxy where \(Experience = Age – Education - 5\), where the 5 represents years before formal education starts.
2. While not strictly a mathematical requirement, some of the nice mathematical features one takes for granted in a regression model will not work if the \(\beta_0\)** term is omitted. We typically recommend retaining the ** \(\beta_0\) ** term, even if it is not statistically significant, unless you really know what you are doing and why.**
Beyond the mathematical assumptions, there are statistical assumptions. These assumptions are a bit trickier than the mathematical ones because the model will typically produce results even when the assumptions are violated – the results may be unreliable – but you will get them. A good analyst will want to and how to confirm that they apply. Clearly, this list is only meant as a minimal technical starting point.
11.3.3.2 Statistical Requirements
1. The model must be specified correctly. In other words, the relationship between the explanatory variables must match that of the real data generating process. Naturally, this means that the relationship must be linear in the general sense described above.
This assumption is probably the least appreciated and most important assumption when it comes to prescriptive analytics. It implies that, for prescriptive analytics at least, the model has to be consistent with the true but unknown data generating process.
There are several ways that models can be misspecified. The most common problem is to have variables. This is often a result of simply not having the data. How significant this problem is depends on the relationship between the variables that are missing and the ones that have been included. If the omitted data is correlated with the included data, the effects of the omitted data will be wrongly attributed to the included data. For example, suppose that productivity of an assembly line worker, increases with work experience, but that work experience cannot be easily measured, though years of education are available. One might be tempted to build a model omitting the variable experience and produce a model:
\[Productivity = \beta_0 + \beta_1 Yrs\_Education + Error\]
The model is likely to work out just fine because the omitted variable, experience, is not likely to be strongly correlated with education. There may be some cohort effects – in that levels of education have been going up over time, so there may be a slight negative correlation between the two variables, but for this example let’s assume it is negligible. In that case, the estimated impact of education, such as it is in this model, is likely to be unbiased and the effect of experience would end up in the error term.
Suppose the data scientist doing the analysis realizes that, while experience may be hard to measure, age is not, and age is correlated with experience, particularly if education is also considered. So perhaps the analyst will chose to include Age even though it doesn’t theoretically belong, but because it improves the fit of the model. The resulting model would be:
\[Productivity = \beta_0 + \beta_1 Yrs\_Education + \beta_2 Age + Error\]
In this case, the omitted variable is strongly correlated to one that is included in the model. This means that the impact of age will be It is likely that age will appear to be associated with higher productivity, even though it is not. And, ironically, including it will improve the fit of the model – and even its in-sample prediction. If this mistake was not detected, and the model were used for recruiting purposes, the managerial implication would be to hire older workers, without regard for their actual experience. This would very likely not produce the productivity gain the client expected.
Even If the model has the right variables, there can still be problems. The model could have the wrong relationships amongst the variables – such as leaving out differences between subsets of the data, leaving out a structural change, or neglecting nonlinear effects. Conveniently, these are the exact extensions to the linear regression model that we demonstrated above!
As you saw when we addressed these problems, we typically corrected the problems by adding new variables to the model. In some sense, failing to correct these problems is often a special cases of omitting variables. You will encounter other cases as you study more regression and related models.
In summary, for model specification errors, the good news is that the structure of the model as
\[Y = \beta_0 + \beta_1 X_1 + ... + \beta_K X_K + Other\; Factors\]
allows the omission of a large number of other important factors provided the impact of those factors is independent of the included ones. However, if excluded variables are correlated to the ones that are included in the model, the estimates will be distorted and using the model for perscriptive analytics will be risky, though the model might still useful for predictions,
2. The error terms each have mean zero, uniform variance and are independent of all other variables. This assumption compounds two assumptions that typically arise in textbooks. We have grouped them together for convenience. Violations of the uniformity of error variance can give rise to a variety of issues. Two common ones are and for time series, though there are others that you may encounter if you study this area further. Fortunately, heteroscedasticity and autocorrelation can be detected and managed with modern statistical software.
3. The X terms can be treated as fixed in repeated samples. This assumption is a bit of overkill. It is designed to ensure that the randomness of the X-variables cannot lead to problems in estimating the parameters due to a relationship in the randomness among the explanatory and dependent variables. Killing off randomness by assuming the x-variables are fixed will certainly remove the prospect of any relationship, but can be replaced by weaker assumptions about independence.
The key problem here is that you don’t actually get to choose the character of your X-variables, so you have to use what you get. Techniques exist for identifying and dealing with some of the problems that arise from this violation.
11.3.3.3 So how does it work?
Linear regression uses the ordinary least squares method to find the parameters that best fit the data in the sense that it minimizes the sum of the square of the individual error terms. To calculate a collection of errors, we need to substitute the estimated values for each of the beta terms, \(\beta_0 ... \beta_K\) with their estimated values, these values are typically noted by putting a character on the individual beta terms. So the regression equation becomes:
\[Y = \hat{\beta_0} + \hat{\beta_1}X1 + ... + \hat{\beta_K}X_K + \hat{E}\]
Rearranging this equation and focusing on a single observation, \(i\), we get the \(i\)’th estimated error term (also known as a residual):
\[y_i - (\hat{\beta_0}+\hat{\beta_1}X_{i1} + ... + \hat{\beta_K}X_{iK} - \hat{y_i}) = \hat{e_i}\]
Graphically, this value can be depicted as:Figure 11.10: Linear Regression with Error Calculations
Since all the estimated values lie along the regression line, you can see that they have less variability than the original values do. This turns out to be the basis for one of the important measures of fit for a regression model, \(R^2\), which measures the share of the original \(y\)’s variability or more technically, its variance that is explained by the variance in the model. In some sense, a perfect model would explain all of the variability in \(y\).
Because of the importance of explaining the variation in outcomes, analysts often over-focus on the role of variance explained in selecting the best regression model to use. Since adding x-variables to any model necessarily increases \(R^2\) by some amount, an over-focus on \(R^2\) tends to create larger and larger models where variables with no theoretic and minimal statistical impact are included in the model. This approach, occasionally described as ‘chasing \(R^2\)’, invariably leads to the creation of unreliable models. This issue will be discussed further in later texts.
Effectively, the OLS method finds the estimated beta terms that minimize the sum of the square of these individual estimate errors. While this may seem like an arbitrary choice, it is actually supported by
We need not go into the justification for this course beyond saying that there are very good reasons behind this choice when the assumptions of the regression model hold. The actual production of regression estimates is done by statistical software. Aside from undergraduate students in classes being taught by mildly psychotic econometrics professors, no one ever calculates a regression estimate by hand. For example, in R we can build the simple linear model:
\[Grocery\; Bill = \beta_0 + \beta_1 Family\_Income \]
with the following code and the grocery store data:
regression <- lm(grocery$Grocery_Bill ~ grocery$Family_Income, grocery)
regression##
## Call:
## lm(formula = grocery$Grocery_Bill ~ grocery$Family_Income, data = grocery)
##
## Coefficients:
## (Intercept) grocery$Family_Income
## -22.653283 0.002577
As discussed earlier, the estimated Beta terms can be calculated using matrix algebra:
y <- grocery$Grocery_Bill
X <- grocery$Family_Income
int <- rep(1, length(y))
X <- cbind(int, X)
betas <- solve(t(X) %*% as.matrix(X)) %*% t(X) %*% y
betas## [,1]
## int -22.653283205
## X 0.002577417
And a simple linear model is no different than the line of best fit from a scatterplot:
11.3.4 The Role of Data in Regression
11.3.4.1 Quantity and Scope (n and k) in Econometric Modelling
The quality of analytics models are ultimately limited by the quantity and quality of data available to the modeler. For regression modelling, we can think of the data as having two important dimensions, \(n\) and \(k\). \(n\) refers to the number of observations on each of the data elements, which is to say the size of the sample. \(k\) refers to the number of distinct data elements being measured for each element in the sample, which is to say the scope of the data.
Generally speaking, the larger the sample, n, the better. For practical purposes, no one would trust a regression model with fewer than about 30 observations per dependent variable used.
As sample size increases beyond that minimum, two things occur: your estimated results become more accurate and the model becomes less sensitive to certain assumptions.
The accuracy of the model tends to grow because, as we saw with estimates for sample averages, the standard error of the estimated sample mean tends to fall in proportion to \(\frac{1}{\sqrt{n}}\). Something similar happens with regression estimates of the Beta terms – which are in some sense just conditional averages. This means that as the sample increases in size, the estimates become more accurate, but at a decreasing rate: the first 500 observations are much more valuable to you than the second 500 observations.
In terms of our conceptual model for regression, where we say:
\[Your\; Earnings = Average\; of\; your\; Class + Your\; Individual\; Variation\]
Increasing the sample size improves the estimate of the ‘average of your class’ part. It does nothing to address the ‘individual variation’ part. So increasing sample size improves the prediction for an average member of a group, but not an individual member.
Beyond the improvement in accuracy, increasing sample size can reduce the sensitivity of the model to its assumptions. To explain why, we need to recognize that there are two theoretic categories for sample size: standard regression modelling and asymptotic theory.
Standard regression theory requires more strict assumptions about the nature of the error term. It generally speaks to predictions that are unbiased if the OLS assumptions hold as described above. However, one additional assumptions is typically very beneficial here – that the error terms have a normal distribution. This assumption was not listed the set of assumptions above because many of the results of interest hold without it and it is not generally required for asymptotic theory.
Asymptotic theory is used when the samples are sufficiently large as to be treated as though they were ‘approaching infinity’. While that may seem like it should be a very large number, infinity in this sense at least, is not what it used to be – generally a couple thousand observations should do it. When dealing with asymptotic theory, we think of the parameter estimates as converging on the true values if the assumptions hold true. Under asymptotic theory, the residuals can be treated as asymptotically normal which allows for effective hypothesis testing even if the residuals are not normal. The bottom line here as that your first consideration of sample size is to get a couple thousand observations to ensure that asymptotic theory applies. Then you want to increase sample size, if required, to ensure that your estimates are sufficiently accurate to serve your purposes. In terms of sample scope, \(k\), the more variables with a credible claim to belonging in the model, the better. In terms of our conceptual model for regression:
\[Your\; Earnings = Average\; of\; your\; Class + Your\; Individual\; Variation\]
Having more variables that belong in the model allows you to better characterize the class to which an individual belongs and therefore reduces the individual’s variation relative to that class. In short, more variables that belong in the model means less residual variation, means better estimates. The trick here is to avoid pulling variables into the model when they do not belong.
This section was largely informed by Kennedy’s text Kennedy (2008).
Chapter 12 Appendix
12.1 Installing Excel’s Analysis Toolpak
To install the Analysis Toolpak in Excel 2010, 2013, 2016, 2019, and Office 365, click File > Options. The Excel Options dialog box will open.
Figure 12.1: Add-ins Tab
Make sure Excel Add-ins is selected next to ‘Manage’, and click ‘Go…’. A dialog box will open.
Figure 12.2: Add in selections
Check the Analysis Toolpak and click OK. It is now installed! You can access the Analysis Toolpak under the data tab on the ribbon.
12.2 More technical reading
Glossary
| Term | Definition |
|---|---|
| AER (Average Econometric Regression) | A modelling philosophy principled in theory, suggesting that an inaccurate model requires more complex modelling techniques as opposed to a different specification |
| Alpha | The probability threshold used in hypothesis testing. If the p-value calculated in the hypothesis testing process is below alpha, the null hypothesis is rejected in favour of the alternative. Alpha is also the conditional probability that the null hypothesis would be rejected if the null hypothesis were true at the equality bound. |
| Analytics | A problem solving tool; any time one uses data to provide insight into a business or problem, or inform action. |
| ARIMA (Autoregressive Integrated Moving Average) | AutoRegressive Integrated Moving Average, which is a class of model that captures a suite of different standard temporal structures in time series data. |
| Artificial Intelligence | Intelligence displayed by machines. |
| Autocorrelation | The correlation of a variable with a delayed copy of itself to assess if a correlation in behaviour exists. For example, this can be used in assessing stock prices. |
| Bayesian Theorem | In Bayesian statistics, probabilities are treated as a measure of belief based on both prior and information sample information. In Bayesian statistics, probabilities are effectively treated as distributions instead of being unknown but specific parameters. |
| Beta | The probability of a type 2 error. Unlike alpha, beta is actually a function that decreases with the sample size and the effect size and increases with the standard deviation. Power is 1 – beta. |
| Big Data | A term that refers to data sets which are too large or complex to be analyzed by traditional data processing applications. |
| Binomial Random Variable | A random variable for the number of successes given a fixed number of trials. See text for technical details. |
| Bucket / Bin | In this context, a bucket (or bin) refers to a category defined by an upper and lower bound number that allows the counting of the number of observations that occur in a particular category. The only trick to buckets is ensuring the boundaries do not overlap. For Excel this is done by defining that value is counted in a bucket if that value it is larger than the lower bound and less than or equal to the upper bound. This prevents both double counting and missing values. |
| Chi Distribution | A continouous probability distribution, often used for evaluating proportions. |
| Coefficient (Beta) | A measure to compare the strength of impact each independent variable has on a dependent variable. |
| Collectively Exhaustive Events | Collectively exhaustive events are sets of events that completely exhaust all possible outcomes of an experiment. |
| Collinearity | A condition where some independent variables are highly correlated. |
| Complement | The complement of an event is the event that makes up the rest of the sample space. If the experiment is rolling a single six-sided dice, and the event of interest is rolling an even number then the complement is rolling an odd number. |
| Continuous Random Variable | A random variable that can take on any value over at least some range of values. They have an uncountable infinite number of outcomes, each of which have 0 probability of occurring. This means that, for continuous random variables, probabilities are only defined over ranges of possible outcomes. We think of the probability associated with an individual outcome as a density of probability. While it is beyond the scope of this course, using calculus to integrate over a range of outcomes, we can calculate the probability of that range of outcomes. Integrating over all possible outcomes gives us a probability of one. |
| Conviction | The belief you have in the truth of the alternative hypothesis, given that the null has been rejected. Conviction appears to be an under-appreciated aspect of statistical testing, at least at the introductory level. The issue is that one may reject the null because it is false or because it is true and one commits a type 1 error. The relative probabilities of the null being true, the commission of a type 1 error, and the power of the test can, in a Bayesian view, inform how much conviction one should have in the test’s results. |
| Cost of Failure | The net cost incurred by an organization when it undertakes an action that subsequently fails. In statistics, this tends to be associated with a type 1 error since a type 1 error results in falsely rejecting the null and therefore engaging in an action that is not justified and will ultimately fail. |
| Cross-sectional Data | This is data collected on individuals (e.g. people, processes, things) that does not depend on a time dimension amongst the observations. For example, I could take a sample of 20 students in the class and ask them what time they got up that morning. This data would have something to do with time, but it would still be cross sectional because there is no time dimension relating the individual observations. Any valid analysis on cross-sectional data can be done and will produce the same results regardless of how the data is sequenced. |
| Data Science | A multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. |
| Dependent Variable | The output or outcome whose variation is being studied. |
| Discrete Random Variable | A random variable that can take on only specific values, not ranges. The number of possible outcomes can be finite or infinite, but they are countable in the technical sense of the word. For discrete random variables each outcome has a non-zero probability, which can be though of as the mass of probability associated with that outcome. The sum of probabilities over all outcomes adds to one. |
| Dummy Variable | A numerical variable used in regression analysis to represent subgroups of the sample in a study. For example, it can be applied to differentiate Canadians vs. Americans. Also known as an indicator. |
| Econometrics | The application of statistical methods to economic data. |
| Elastic Net | A combination of Ridge and Lasso regression. |
| Elementary Events | An elementary event is an event that cannot be decomposed into other events |
| Events | Events are collections of one or more possible outcomes. Literally, they are the things that can happen. |
| Experiment | An experiment in this context is any process that can yield an uncertain result from a well-defined collection of outcomes (events). |
| F Distribution | A right-skewed distribution used in analysis of variance. It is used to compare statistical models that have been fitted to a data set to identify the model that best fits the population. |
| False Negative | A false negative occurs when a test concludes the person does not have a disease when in fact they do. This is also known as a Type 2 error in hypothesis testing where the null hypothesis is not rejected when it should have been. |
| False Positive | A false positive occurs when a test concludes the person has a disease when in fact they do not. This is also known as a Type 1 error in hypothesis testing, where the null hypothesis is rejected when it should not be. |
| Feature | A distinctive attribute. |
| GLS | Generalized Least Squares a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. |
| Heteroscedasticity | A characteristic of data that leads to an increase or decrease in the variability of the error term of a regression. |
| Hypothesis Testing | Hypothesis testing is a formal statistical process that attempts to weigh the evidence from a sample to determine whether a particular belief and its related course of action are justified. |
| Hypothesis Testing | A formal method of statistical decision making. It is based on assuming that a particular data generating process is used to create a sample and then assessing the probability that the process could generate the observed result. |
| Independent Events | Independent events are ones where the occurrence of one event tells you nothing about the probability of the other’s occurrence. |
| Independent Variable | The input which is assessed for potentially causing a variation in an output. |
| Interaction Dummy | A classic dummy variable (or indicator) multiplied by a continuous variable. Allows for slope changes between groups in a regression. |
| Intersection | The intersection of two sets is the collection of elements found in both sets. It is typically denoted by the symbol ∩ or the word “and”. |
| Language of Probability | As a formal mathematical discipline, probability has its own well-defined language to describe such things as sets, events, and the relations between them. Understanding this language is important for the formal aspects of probability because English and other ‘natural’ languages lack the precision required for mathematics. |
| Lasso Regression | Least Absolution Shrinkage and Selection Operator. A type of linear regression which penalizes coefficients far from zero. |
| Linear Regression | A linear approach to modeling the relationship between independent variables and a dependent variable. |
| Modeling | Modeling in this context refers to developing a mathematical model that describes or predicts various outcomes in a business context. For example, one could use a probit or logit model to understand the probability that a customer will purchase a product at a given price given some characteristics of the marketplace and customer demographics. Such a model could be used as an input to select the most profitable price or mix of advertising channels. |
| Mutually Exclusive Events | Mutually exclusive events are events that cannot both occur simultaneously. |
| Non-Normal Error Term | Any time the error term of your regression does not follow a normal distribution. This violates an underlying assumption of OLS. |
| Normal Distribution | The normal distribution is a bell-shaped continuous distribution that commonly occurs in statistics. The distribution is characterized by its mean and standard deviation. Its relevance in statistics arises because of the CLT which suggests that many averaging and summing processes produce normal distributions. |
| Number Line | A number line is a graphical representation, typically of whole numbers. They are very useful for organizing one’s thinking with respect to events, complements, and probability statements. |
| OLS | Ordinary Least Squares is a type of linear least squares method for estimating the unknown parameters in a linear regression model. |
| One Tailed Test / Alternative | The situation where the alternative hypothesis is framed in such a way that only evidence in one direction, either large or small, can lead to the rejection of the null hypothesis. In this case, the p-value is calculated only in the one tail that lies in the ‘direction of rejection’. Such an alternative might take the form: H1: Incomes high enough to support more expensive houses => H1: Mu_Income > 100,000. In this case, only a sample with high incomes could lead to rejecting the null. |
| Opportunity Cost | The value of the lost opportunity associated with any action that is taken. Here it typically means the value that is lost when a hypothesis test yields a type 2 error, leading the decision maker not to engage in what would be a valuable action. Opportunity cost tends to be positively related to effect size, which means that the decisions that would have higher opportunity costs tend to be easier to detect in a statistical test. |
| Ordinal Data | Data organized in accordance with a priority category, in a scaled manner. For example, satisfaction ratings. |
| Panel Data | Panel data combines cross-sectional with time series data by tracking several different things over time. For example, the daily closing price of 10 individual stocks in a portfolio tracked over a one-year period would be panel data. Any cross-sectional analysis would not depend on the sequencing of the cross-sectional elements (e.g. you could sort the stock alphabetically or randomly) but the time dimension cannot be re-sequenced. Generally, panel data introduces additional options for analysis beyond that available with just time series data. |
| Poisson Random Variable | A discrete random variable that describes the number of times an event will occur over a range during which it could occur with a well described average rate. See the textbook for details. |
| Populations | A population is the complete set of things that we care about. For example, we could care about all the current residents of Berlin, which is a population of a finite size. The characteristics of populations are called parameters. |
| Power | The ability to reject a null hypothesis when the null is false. Power is equal to 1 – Beta, so power is a function that increases with sample size and effect size and decreases with standard deviation. Power can also be increased by increasing alpha, though increasing alpha is not generally a good way to increase power. |
| Probability Density Function (PDF) | These are mathematical functions that associate probabilities with continouous random variables. Because the individual outcomes of continuous random variables have 0 probability, the probabilities of events described by PDFs cannot simply be added up but in principle must be integrated using calculus. In actual fact, we seldom use PDFs as managers but rely on software to calculate the values for us. |
| Probability Mass Function (PMF) | These are mathematical formulas that define the mass of probability associated with specific outcomes in discreet distributions. We will seldom encounter them directly as they are typically built into software such e.g. Excel’s Poisson.Dist and Binomial.Dist functions. All probability mass functions start at the left most point and accumulate values as we move to the right. |
| P-Value | The probability that a data set ‘at least as extreme as the one just observed’ would be generated by the data generating process described by the null hypothesis. There is a world of difference between rejecting a null with p-value = 0.049 vs. rejecting that same null with p-value 0.0000001, so p-values should be available for any consumer of a statistical test. |
| Random Events | Random events correspond to situations where there is uncertainty about what will happen. Often, we will model random events by describing a set of outcomes that could occur and then use probability theory or other models to answer question such as how likely a particular event is, and what the most likely outcome is, etc. |
| Random Variables | A random variable is a function that links a random event to a number. So, if the event is flipping a coin, the events may be S = {Heads, Tails} and the random numbers could be 1 if heads, 0 if tails. |
| Ridge Regression | A type of linear regression aimed at decreasing model complexity without removing variables from the model. Ridge regression forces variables coefficients closer to 0. |
| Sample | A sample is (generally) a subset of a population. Samples can be constructed with replacement (the same item can be included in the sample multiple times) or without replacement (an item can be included in the sample at most once.) Proper construction of a sample is a very important, but often underappreciated aspect of analytics. |
| Sample Space | A sample space is the collection of possible outcomes or events associated with an experiment. |
| Sets | A set is any collection of objects. Typically, this is used here to describe sets of events or sets of numbers. For example, the set of possible outcomes of rolling a single, normal dice is getting a 1, 2, 3, 4, 5, or 6. This might be written in set notation as S = {1, 2, 3, 4, 5, 6} which is a set of numbers. We might also describe a set of events such as S = {Rolling an even number, Rolling a number less than 4}. |
| Standard Normal | The standard normal distribution is a normal distribution whose mean is 0 and standard deviation is 1. The standard normal was particularly relevant in statistics before computers made it easy to calculate probabilities associated with any normal distribution. It remains important because of its association with the t-distribution as well as its use in many standard models. |
| Success | Success is the definition for the outcome whose number of occurrences is being aggregated in the Binomial distribution. |
| T-distribution | The t-distribution is visibly similar to the standard normal distribution except that it is a bit shorter and wider. T-distributions arise in situations where a random variable would have a standard normal distribution with a known standard deviation, for example, as a result of a CLT, but the standard deviation is estimated rather than known. T-distributions have degrees of freedom, at this point, based on n-1. As n approaches infinity, the t-distribution converges to the standard normal. |
| Time Dimension | For some data, such as time series and panel data, each observation has an indication of when it occurred. For example, the closing price for a particular stock occurs once per day and that time dimension is critical to understanding the data. |
| Time Series Data | Time series data pertains to data where the individual observations can only be understood by considering the temporal relationship between them. For example, the closing price of a particular stock on each day is a time series because the data can only be understood in relation to the other observations and in relation to their temporal relationship. The closing price today is more closely related to yesterday’s closing price than the closing price a week ago. While time series data can be analyzed as cross-sectional data only (e.g. what was the average closing price last year) true time series analysis depends on sequencing of the data and would result in very different and typically meaningless analysis if the data is re-sequenced. |
| TTT (Test Test Test) | A modelling philosophy largely grounded in testing variable significance. The philosophy ‘tests down’ from larger models. |
| Two Tailed Test/Alternative | The situation where the alternative hypothesis is framed in such a way that evidence in both directions can contribute to rejecting the null hypothesis. Such an alternative might take the form: H1: Incomes in Quebec are different from those in Ontario so we must change our marketing plan => H1: Mu_Income_Ontario – Mu_Income_Quebec <> 0. In this case, only either low or high sample averages could lead to rejecting the null. In this case, the p-value is calculated as the area in the tail multiplied by two. |
| Type 1 Error | The error that occurs when the null hypothesis is rejected even when it is true. In a business situation, this would typically mean that the data suggested an action should be undertaken even though it ultimately failed. As a result, type 1 error is associated with the cost of failure. |
| Type 2 Error | The error that occurs when the null hypothesis is not rejected even though it is false. In a business situation, this would typically mean that evidence was not found to support what would have been a good decision and consequently a promising action was not undertaken. As a result, type 2 errors are associated with opportunity cost. |
| Uniform Distrbution | The uniform distribution is possibly the simplest continuous distribution. It is characterized by an upper bound (b) and lower bound (a) wherein every equally sized range has equal probability of occurring. The distribution looks like a box bounded on the left and right by a and b respectively with a height of 1/(b-a). While it is exceptionally convenient, it does not arise frequently in real world problems, though it is occasionally used in situations of maximal ignorance. |
| Z-Score | The Z-score formula, Z = (x – mu) / sigma, is used for finding a point on the standard normal that corresponds to a given point on an arbitrary normal distribution. The point corresponds in the sense that the probability of finding a value below that point is the same. |
Kennedy, Peter. 2008. “A Guide to Econometrics 6e.”